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Abstract 

A  widely  accepted  approach  to  the  formal  modelling  of  learning  is  to 
view  it  as  a  process  of  constructing  high-level  structures  that  allow  com- 
pression of  raw  data.  The  main  results  of  this  thesis  are  a  number  of  precise 
formulation  of  this  process,  and  a  few  algorithms  that  carry  it  out. 

Our  rigorous  definitions  allow  us  to  identify  of  a  novel  concept  of  struc- 
ture: the  alphabet.  We  propose  that  a  learning  mechanism  is  one  which 
constructs  such  structures  as  representations  of  the  given  data.  The  alpha- 
bet allows  to  draw  interesting  links  between  theories  in  pattern  recognition 
and  pattern  classification. 

The  structural  descriptions  constructed  by  the  algorithms  are  based 
on  the  two  concepts  of  code  and  classification  used,  as  data  compression 
tools.  We  provide  means  for  evaluating  the  efficiency  of  the  encodings 
created  using  ideas  from  information  theory.  We  reduce  learning  to  an 
optimization  problem  and  we  suggest  that  the  mechanisms  proposed  work 
at  any  descriptive  level. 

Our  approach  was  inspired  by  the  goal  of  interpreting  learning  as  the 
natural  evolution  of  a  physical  system  to  its  ground  states. 

The  applications  discussed  include  an  architecture  and  a  learning  rule 
for  neural  nets  as  well  as  a  procedure  for  grammatical  inference.  Some  of 
the  ideas  proposed  have  been  tested  on  meaningful  examples. 
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"It  is  an  old  idea  that  the  more  pointedly  and  logically  we  formu- 
late a  thesis,  the  more  irresistibly  it  cries  out  for  its  antithesis" 


Hermann  Hesse 
The  Glass  Bead  Game 


Chapter  1 


The  Problem 


1.1      Introduction 

The  story  starts  one  day  in  Olympus  when  Zeus  has  a  headache.  He  calls  up 
Hephaestus,  the  hard  working  locksmith  god,  who  splits  his  head  in  half  and  out 
comes  Athena,  the  goddess  of  Reason  and  War.  That  day  Hephaestus  invents 
divide&conquer,  reason,  and  war  all  at  once.  Since  then  mankind  have  been  using 
them  to  put  Zeus'  head  back  together  and  reconstruct  the  big  picture. 

This  thesis  is  about  learning  and  inductive  inference.  It  is  the  result  of  three 
years  of  work  on  the  subject  with  the  goal  of  arriving  at  a  general  understanding  of 
the  concepts  in  its  attempt  to  reconstruct  the  big  picture  for  the  fragmented  reality 
of  the  community  interested  in  those  subjects.  To  do  this,  it  had  to  be  "general", 
i.e.,  "applicable  to  the  widest  range  of  contexts  and  situations".  Nonetheless  it 
aimed  at  a  formal  comprehension:  the  task  was  to  be  considered  complete  when  the 
understanding  reached  a  mathematical  definition  of  the  terms  involved,  and  when 
the  guidelines  for  a  computer  implementations  seemed  sufficiently  solid. 

In  evaluating  it,  the  reader  should  consider  that  learning  theory  yields  problems 
and  results  which  shed  light  on  many  fundamental  topics  of  today's  scientific  quest. 
Inductive  inference  is  a  paradigm  to  model  several  processes  like  the  search  for 
physical  laws.  Popper's  logic  of  discovery  [6],  the  adaptation  of  a  biological  being 
to  its  environment  [66],  intelligence,  evolution.   We  hope  the  implications  that  its 


results  may  have  wiU  contribute  to  keep  the  interest  in  learning  theory  alive. 

Pragmatic  Applications 

During  my  carreer  as  a  student  I  have  been  deahng  with  computers  in  many 
ways    Though  I  have  spent  great  efforts  in  adapting  to  their  and  their  appendages' 
needs,  I  feel  that  they  have  made  no  effort  to  do  the  same  with  me.  After  usmg  the 
same  few  operating  system  commands  and  addressing  the  same  few  files  over  and 
over  again  I  still  get  a  "not  found"  answer  when  I  mispeU  it  by  a  single  unmeamngful 
letter.  After  several  years  my  faithful  terminal  has  not  learned  yet  that  my  finger 
always  shps  on  a  "k"  and  types  a  «j".    Computer  software  and  hardware  exlubit 
much  too  Uttle  flexibility  to  make  them  really  user  friendly,  make  the  man-machme 
interaction  pleasant,  and  promote  a  wide  diffusion  among  unexpenenced  users.  It 
is  a  pity.  Computer  technology  has  reached  a  degree  of  complexity  suitable  for  non 
trivial  appUcation  in  everyday  hfe. 

The  ultimate  goal  of  the  research  in  learning  is  to  blow  hfe  into  computers  in  the 
form  of  flexibiUty  and  adaptabiUty.  We  only  need  to  observe  primitive  Ufe  forms  to 
be  convinced  that  adaptive  behavior  is  possible  at  any  level  of  functional  complexity, 
and  the  search  never  ends.  This  thesis  is  a  contribution  to  make  that  goal  closer. 

A  large  amount  of  the  work  spent  in  any  software  agency  -academic,  commercial, 
or  pubhc-  is  for  software  maintenance  and  updating.  This  work  could  be  saved  m 
part  if  it  were  possible  to  write  evolving,  self-updating  programs. 

In  Programming  Languages  key  design  criteria  are  writabiUty,  readabiUty  and 
reUabihty.  The  picture  is  clear:  we  want  them  to  look  Uke  natural  languages.  More- 
over when  a  mechanism  developed  by  research  in  computational  Hnguistics  is  rehable 
enough,  it  can  be  employed  in  fields  Uke  Programming  language  where  rehabihty  is 
essential.  Natural  language  is  beheved  to  be  the  major  medium  by  which  a  culture  \ 
bequeaths  its  achievements,  what  it  has  learned.  The  knowledge  of  the  laws  which 
govern  this  mechanism  are  likely  to  be  of  interest  to  both  Programming  Language 
and  Computational  Linguistics. 

Transmission  of  information,  communication,  is  a  key  element  present  in  many 
research  fields.  It  interests  computers  technology  at  every  level.  The  Von  Neumann 


style  computer  is  a  device  that  moves  around  data  and  performs  simple  operations; 
then  information  is  moved  through  several  levels  of  memory  and  peripherals.  The 
increasing  power  of  computer  systems  has  made  it  possible  to  network  systems  all 
around  the  world.  In  general,  theoretical  and  practical  issues  in  parallel  computing 
are  concerned  with  making  the  information  that  a  processor  has  reached  a  certain 
state,  available  to  other  processors. 

The  importance  of  efficient,  reliable  communication,  and  standardization  is  vital 
to  all  these  fields.  They  will  benefit  much  of  the  data  about  the  means  of  communi- 
cation that  mankind  has  been  perfecting  over  the  past  few  tens  of  thousand  years. 
It  is  of  great  interest  to  computer  science  and  technology  to  gain  knowledge  about 
the  laws  by  which  information  is  stored  in  language  and  about  the  reason  why 
languages  defined  ad  hoc  like  Esperanto  axe  difficult  to  use  ajid  not  as  efficient  as 
languages  that  evolved  naturcdly  [84]. 

1.2      This  Thesis 

Before  trying  to  explain  what  was  achieved  in  the  present  thesis,  we  need  to  consider 
what  is  meant  by  the  term  "learning"  and  why  the  concept  is  often  used  in  close 
connection  with  "inductive  inference" . 

The  relations  between  learning  and  inductive  inference  have  been  effectively 
pointed  out  by  Angluin  and  Smith  [1].  The  common  sense  concept  of  "learning" 
is  "to  gain  knowledge"  or,  in  other  words,  to  acquire  data.  "Inductive  inference" 
denotes  the  process  of  hypothesizing  a  general  rule  from  examples. 

Learning  as  Efficient  Acquisition  of  Data 

It  is  clear  then  how  inductive  inference  is  useful  for  learning:  a  general  rule  is 
usually  a  more  efficient  representation  for  a  corpus  of  data.  We  get  to  our  first 
characterization  of  the  concept  of  "learning"  as  considered  in  this  dissertation:  ac- 
quiring data  in  an  efficient  way.  What  "acquiring  data"  means  should  be  sufficiently 
clear  to  a  computer  scientist,  however  "in  an  efficient  way"  requires  some  further 
attention.  Part  of  this  thesis  contributes  to  a  clarification  of  this  idea  by  exploiting 


basic  concepts  from  Information  Theory  where,  for  several  decades,  researchers  have 

studied  the  efficiency  of  data  transmission. 

Our  ultimate  aim  is  to  provide  the  guideUnes  for  the  construction  of  a  learning 

system  that  obeys  laws  of  spontaneous  convergence  towards  more  efficient  operation. 

We  start  our  task  at  the  level  of  perception:    the  system  transforms  the  sensory 

data  into  more  efficient  representations.   The  short  term  construction  of  complex 

representation  is  under stand^ng.  The  long  term  construction  of  the  means  to  obtain 

a  complex  representation  is  the  process  of  /earning. 

Efficiency  is  defined  as  minimal  co^t  where  cost  is  any  appropriate  function.  Our 
exposition  does  not  depend  on  a  particular  choice  of  the  cost  function  (except  that 
it  has  to  satisfy  natural  monotonicity  constraints);  it  lends  itself  to  many  different 
interpretations.  A  suitable  cost  function  is  size,  memory  space;  in  this  situation 
"efficient"  is  synonymous  to  "short".  The  process  of  finding  a  more  efficient  repre- 
sentation can  be  justified  with  the  necessity  to  fit  a  larger  and  larger  amount  of  data 
in  a  smaU  memory  space.  Another  appropriate  cost  function  is  hme;  "efficient  is 
synonymous  to  "quick",  and  finding  a  more  efficient  representation  can  be  justified 
with  the  necessity  for  real  time  computation. 

Learning  as  the  Search  for  Structural  Descriptions 

Nevertheless  we  emphasize  that,  while  learning,  one  acquires  data  with  the  pur- 
pose of  using  it  in  the  future,  in  toto  or  in  chunks.  Supposedly  a  corpus  of  examples 
contains  some  relevant  pieces  of  information  that  one  would  Uke  to  extract.  Our 
goal  is  not  only  a  compressing  algorithm,  Uke  Ziv  and  Lempel's  [86]  which  already 
shows  many  optimality  characteristics;  what  we  desire  is  a  way  to  find  a  structural 
description  of  the  data.  We  get  to  the  second  point  covered  by  this  thesis:  what  is 
"structure"  and  what  are  the  operations  needed  to  discover  it. 

We  wiU  discover  two  operations  to  make  a  representation  more  efficient:  block 
coding  and  class  coding.  While  block  coding  is  a  textbook  technique  in  information 
theory  we  are  not  aware  of  any  previous  proposal  of  class  coding  as  an  optimization 
technique.  Their  description  is  in  this  thesis  is  kept  to  the  bare  essential,  but 
they  are  the  object  of  two  related  research  fields:  Pattern  Recognition  and  Pattern 


Classification. 

By  means  of  these  two  operations  we  will  be  able  to  construct  structures  in  the 
form  of  what  we  call  interpretational  schemes  in  a  context,  or  alphabets.  We  will 
find  strong  evidence  that  this  formalization  of  structure  is  as  expressive  as  other 
established  formalizations,  e.g.  granamars,  programs  or  Turing  machines. 

Learning  as  the  Construction  of  a  Semantics 

The  great  interest  that  computational  linguists  show  for  the  parse  tree  of  sen- 
tences is  due  to  the  beUef  that  with  it  they  caji  retrieve  the  meaning  of  the  sentence. 
This  is  true  in  general:  structural  description  cind  meaning  are  closely  related.  We 
axe  led  to  consider  a  third  definition  of  learning:  constructing  a  semantics  for  the 
data  under  observation.  This  definition  is  a  little  more  difficult  to  grasp,  especially 
because  "meaning"  and  "semantics"  axe  not  yet  completely  understood  concepts. 
We  can  get  an  informal  intuition  of  the  idea:  when  children  learn  their  first  language 
they  attach  a  meaning  to  a  few  strings  of  phonemes  and  incrementally  proceed  to 
discover  the  meaning  of  longer  and  longer  utterances.  In  this  dissertation  there  is  an 
attempt  to  formalize  the  idea  of  "semantics"  and  hence  to  clarify  what  the  phrase 
"to  construct  a  semantics"  means. 

The  embryonic  results  achieved  axe  sufficiently  clear  to  give  a  definite  feeling 
for  what  the  semantics  is,  how  it  is  in  relation  with  reference  and  in  what  way 
the  semantics  of  a  symbol  depends  on  its  context.  We  hope  that  the  formalization 
proposed  will  why  the  construction  of  a  structural  description  is  in  fact  the  same  as 
the  construction  of  a  semantics  and  will  capture  the  reader's  common  sense  idea. 

Learning  as  the  Construction  of  an  Efficient  Language 

Children,  when  Iccirning  their  mother  tongue,  construct  their  own  language, 
which  should  coincide  with  the  language  of  their  community.  However,  the  language 
constructed  is  one's  own:  there  axe  cases  of  twins  who  develop  a  language  which  is 
not  that  of  their  community.  We  have  pointed  out  that  learning  has  many  relations 
with  inductive  inference  and  the  theory  of  inductive  inference  hinges  upon  the  notion 
of  language,  as  we  will  see  in  the  historic  introduction  to  the  problem.   Languages 


„=  in  one  to  one  correspondence  with  classes  of  grammars,  or  classes  o£  any  other 
computational  devices  like  programs,  Turing  Machines,  etc.  Informally  speaking, 
as  a  result  of  this  one  to  one  correspondence,  we  can  identify  the  notion  of  language 
with  that  of  computational  device.  If  we  take  the  learning  set  to  he  -  '«M^™  ^ 
problem  of  inductive  inference  becomes  that  of  constructing  a  language  for  that  text 
'we  allow  pluraUsm  and  will  not  say  "the  language  of  that  text").  Then,  learmng 
becomes  constructing  an  efficient  language  for  a  jreen  text. 

The  development  of  this  point  reduces  to  pointing  out  equivalences  betwee,> 
weU  established  notions.  For  this  reason  we  wiU  ground  it  on  experiments  wrth  real 
Hnguistic  data  and  on  the  exposition  of  examples  of  the  concepts  constructed  by 
other  known  ideas  that  they  want  to  model. 

This  dissertation  wiU  clarify  why  the  various  definitions  for  learning  given  above 
do  in  fact,  coincide.  The  knowledge  that  acquiring  data  efficiently,  flnding  struc- 
tural descriptions,  constructing  a  semantics,  constructing  a  language,  aU  descnbe 
"learning" ,  can  be  very  useful  in  finding  relations  and  applications  in  many  different 
fields  of  science.  For  instance,  in  Section  3.3  we  wiU  consider  an  attempt  to  hndge 
the  gap  between  the  so  called  "symbohc"  and  "subsymbohc"  learning.  Our  behef 
is  that  the  difference  is  only  in  the  way  of  expressing  the  ideas  and  that  there  e^sts 
a  thorough  coincidence  in  the  underlying  mechanism. 

1.3     A  Bird's  Eye  View 

The  present  chapter  proceeds  with  a  historical  introduction  to  problems  in  different 
fields  addressed  by  this  thesis,  so  a.  to  point  to  possible  appUcations  of  the  results 
achieved.  However,  no  knowledge  in  those  fields  is  required  as  prereqms.te  for  he 
exposition  in  later  chapters.  The  theory  presented  is  self-contained.  Some  examples, 

however,  might  require  further  external  information. 

The  introductory  chapter  is  foUowed  by  three  main  chapters.  Also  these  smaller 

subunits  are  reasonably  self-contained:  they  don't  depend  on  each  other  except  for 

some  definitions  and  notational  conventions  and  they  can  be  read  mdependently. 

However  the  thesis  wiU  be  fully  appreciated  only  when  the  true  connections  between 


the  three  main  chapters  axe  drawn. 

Chapter  2  is  devoted  to  the  construction  of  a  rigorous  language  so  as  to  give  a 
concrete  meaning  to  the  several  working  definitions  for  learning  given  in  the  state- 
ment of  our  purpose.  All  the  expressions  used  will  be  given  a  precise  formalization 
on  the  basis  of  few  elementary  notions. 

The  mathematical  approach  that  we  embraced  in  introducing  our  new  language 
gives  another  advantage  besides  rigor:  the  terms  involved  should  be  taken  with  no 
preconceived  interpretation  but  only  in  their  explicit  mathematical  relations.  The 
reader  is  only  supposed  to  understand,  not  also  to  believe.  Nontheless  we  will  use 
familiar  names  to  favor  the  interpretation  that  helps  in  understanding. 

The  entire  dissertation  hinges  upon  the  notion  of  alphabet,  so  the  main  aim  of 
Chapter  2  is  to  provide  a  rather  general  definition  of  alphabet  and  to  show  that 
it  is  grounded  on  the  notions  of  code  and  classification.  Nonetheless  some  results 
presented  are  interesting  in  themselves.  The  first  is  Theorem  1  which  governs  the 
proper  interaction  between  patterns  and  classifications.  The  second  is  Theorem  2 
which  shows  that  the  structure  constructed  is  as  powerful  as  any  other  computa- 
tional structure. 

As  we  have  pointed  out,  it  is  not  mandatory  to  read  Chapter  2  in  order  to 
understand  the  following  Chapter  3  and  Chapter  4.  The  reader  can  do  without  it 
and  tcuke  alphabet  as  the  generalized  notion  of  the  familiar  concept  as  it  is  done  in 
the  theory  of  free  semigroups. 

Chapter  3  deals  with  the  notion  of  measurement  which  is  a  fundamental  concept 
for  any  system  trying  to  gain  knowledge.  Once  again  ideas  will  be  studied  in  their 
full  generality.  The  main  aim  of  this  chapter  is  to  clarify  the  notion  of  efficiency  with 
respect  to  a  given  cost  function.  In  doing  so  we  will  be  able  to  recognize  that  there 
exists  a  gain  in  efficiency  when  defining  classification  but  only  if  the  classification 
is  considered  jointly  with  a  set  of  patterns,  i.e.  a  code.  In  this  chapter  we  will  draw 
connections  between  the  ideas  constructed  with  the  notions  of  energy,  entropy  and 
temperature  and  we  will  be  able  to  look  at  learning  as  a  process  of  evolution. 

Chapter  4  is  devoted  to  pragmatics  and  will  contain  general  learning  algorithms 
based  on  the  notions  of  previous  chapters.   We  will  introduce  the  technique  of  hi- 


erarchical  search  in  order  to  improve  the  running  time  of  the  obvious  enumerat.ve 
algorithm.  The  main  idea  set  forth  points  to  the  necessity  to  introduce  cost  func- 
tions, Uke  running  time  and  accuracy,  as  parameters  to  the  learning  algonthms 
constructed.  We  will  build  up  algorithms  which  are  feasibly  implementable. 

1.4     Historical  Motivations 

There  is  always  a  temptation  to  explain  a  general  idea  without  giving  any  example 
in  order  not  to  freeze  the  reader's  imagination  on  a  particular  theme.  We  w.11 
overcome  this  temptation  and  ground  the  exposition  on  the  explanation  of  the 
problems  and  the  results  that,  historically,  have  motivated  much  research  m  he 
field  The  foUowing  sections  provide  the  necessary  points  of  connection  with  the 
fields  of  inductive  inference,  algorithmic  information  theory,  pattern  classification, 
hierarchical  systems,  and  language  acquisition. 

1.4.1     Inductive  Inference 

Enumeration 

The  first  step  in  many  scientific  enterprises  is  to  translate  its  problems  into 
formaUzed  language.  Our  goal  is  to  construct  a  mechanism  which  provides  a  struc- 
tural description  for  the  reality  under  observation  and  these  concepts  have  been 
formalized  in  many  equivalent  ways.  We  adopt  the  one  exposed  in  the  foUowmg. 

The  reality  for  which  we  want  to  find  a  structural  description  (or  still  better,  a 
general  rule)  is  presented  as  a  sequence  of  symbols,  i.e.  in  the  form  of  a  text  T.  Let 
us  assign  a  number  to  each  symbol,  then  the  text  becomes  a  sequence  of  numbers. 
Now  we  have  to  make  a  distinction  and  decide  whether  we  want  a  general  rule  for 
the  sequence  as  an  ordered  set  or  simply  as  an  unordered  set,  in  other  words  we 
have  to  decide  whether  the  sequence  is  one  big  instance  or  is  a  set  of  examples.  This 
distinction  might  bother  us  if  we  think  back  to  the  general  problem,  but,  under  our 
model,  the  two  different  cases  can  be  treated  essentially  in  the  same  way,  with  minor 


differences  in  details. 

With  no  loss  of  generality  (at  least  in  principle)  we  can  formulate  the  problem 
in  a  more  circumscribed  domain  of  recursive  function  theory  as  it  has  been  done  by 
Gold  [38] and  Blum  and  Blum  [6].  The  sequence  is  a  recursively  enumerable  set  and 
the  description  for  that  set  is  a  partial  recursive  function.  The  solution  was  given 
in  [6]:  for  several,  quite  general  classes  of  recursive  enumerable  sets  it  is  possible  to 
construct  an  algorithm  which  yields  such  a  partial  recursive  function  by  looking  at 
one  element  at  a  time.  The  algorithm  is  based  on  the  enumeration  technique  whose 
general  idea  runs  as  follows:  enumerate  all  the  partial  recursive  functions  and  check 
if  they  halt  on  all  the  elements  seen  so  far.  Eventually  the  right  one  is  hit.  Of 
course  things  don't  always  go  smoothly,  one  needs  a  little  care  in  dovetaiUng  and 
in  considering  only  partial  recursive  functions  for  which  the  predicate  [(f>^(^i){x)  =  y] 
is  decidable.  See  [6]  Example  2. 

It  is  now  easy  to  stretch  the  concepts  and  take  the  description  of  a  sequence 
-any  kind  of  symbols-  to  be  any  computational  model  that  generates  that  sequence 
as  an  output.  Then  the  enumeration  technique  can  be  applied  also  to  different 
domains  like  grammatical  inference,  as  it  has  been  used  by  [5]  whose  system  proceeds 
to  construct  a  grammar  by  successively  adding  rules  that  are  consistent  with  the 
positive  exaxnples  observed. 

Unfortunately  the  framework  of  identification  by  enumeration  raises  two  sorts 
of  forethoughts.  The  first  foUows  the  observation  that  there  are,  in  general,  many 
partial  recursive  functions  which  identify  the  same  set  as  well  as  many  grammars 
which  identify  the  same  language.  The  second  concerns  the  high  computational 
cost  of  an  enumeration. 

Information  Complexity 

Recursive  functions,  grammars,  or  programs  as  described  above,  are  representa- 
tions for  the  given  text.  We  have  pointed  out  that  we  aim  at  efficient  representation^ 
so  we  also  need  to  formalize  notion  of  "best".    The  "best"  description  for  a  text 


^This  is  obviously  just  a  standpoint  motivated  by  the  trend  common  to  much  scientific  work  and 
discovered  in  much  natural  events  summarized  by  Whitehead's  law  of  universal  lazyness 
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(among  aU  the  possible  ones)  is  achieved  by  considering  what  is  "optimal  with  re- 
spect to  a  given  cost  function".  So  if  the  cost  function  is  length,  then  the  best 
description  wiU  be  the  one  whose  length  is  the  same  as  Kolmogorov  complexity^ 
K  of  the  sequence.  Unfortunately  Kolmogorov  complexity  is  not  computable  [87] 
and  that  creates  a  problem  in  using  the  concept  of  minimal  description  [52,70,71]. 
One  of  the  motivations  for  the  present  thesis  is  the  necessity  to  use  another  well 
estabhshed  information  complexity,  Shannon  entropy  which  bounds  K  from  above 
([86]  Theorem  5.1).  Shannon  entropy,  or  any  other  entropy-Uke  complexity  [18],  is 
the  natural  complexity  measure  for  the  alphabet,  the  model  of  structure  used  in  this 
thesis. 

The  relations  between  Kolmogorov  complexity  and  Shannon  entropy  have  been 
studied  from  several  standpoints  [19,20,29,31,55,87].  We  found  the  use  of  Shannon 
entropy  interesting  because  our  situation  allows  one  to  change  the  scheme  with 
respect  to  which  complexity  is  computed.  Relative  entropy  is  more  easily  computed 
in  this  way  because  we  don't  need  to  go  through  the  universal  scheme. 

Continuity 

In  order  to  understand  why  the  enumerative  solution  presented  above  is  not 
effectively  workable  we  should  make  the  foUowing  observation.  For  the  sake  of  our 
reasoning  let  us  choose  programs  as  computational  model;  analogous  considerations 
can  be  made  with  any  other  different  choice.  Enumeration  of  programs  is  based  on 
the  choice  of  a  conventional  Godel  numbering  [30]  which  is  constructed  by  consid- 
ering the  exterior  appearance  of  the  program.  It  is  an  intrinsic  property  of  Godel 
numbering  that  programs  which  have  a  similar  behavior  and  similar  exterior  appear- 
ance  can  be  mapped  to  distant  numbers.   Let  us  try  to  iUustrate  this  phenomenon 
with  a  different  example.    Rational  numbers  are  denumerable  but  there  exists  no 
enumeration  that  preserves  their  natural  order.  So,  if  one  is  asked  to  find  a  rational 
within  a  given  distance  e  from  a  given  irrational  number,  and,  as  only  means  of 
producing  numbers,  one  is  given  an  enumeration  of  the  rationals,  then  he  will  never 

'Kolmogorov  complexity  is  defined  just  as  the  length  of  the  shortest  program  that  computes  the 
sequence  with  respect  to  a  universal  computing  scheme  [51,87,19,20].  A  sequence  is  structured  if  its 
Kolmogorov  complexity  is  smaller  than  its  length. 
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be  able  to  use  the  information  that  a  particular  choice  was  close.   He  will  have  to 
stick  with  the  enumeration  and  wait. 

This  thesis  is  an  attempt  to  render  the  representation  of  the  computational 
model,  of  its  input,  and  of  its  output,  close  enough  in  appeaxance  so  that  a  small 
change  in  the  input  would  cause  a  small  change  in  the  output  for  a  given  instance 
of  the  model  (program),  and  such  that  a  small  change  in  the  program  would  cause 
a  smcdl  change  in  the  output  for  a  given  input.  This  result  is  achieved  with  the 
introduction  of  the  alphabet  as  a  computational  model.  However,  we  should  em- 
phasize that  the  present  work  is  just  a  first  step  towards  the  goal  of  identifying  the 
intuitive  notion  of  continuity  as  explained  above.  Its  rigorous  understanding  looks 
by  fax  a  more  difficult  task  in  our  irredeemably  discrete  domain.  This  problem  has 
also  been  addressed  in  the  different  domain  of  lambda  calculus  and  it  is  discussed 
in  [4]. 

Computational  Complexity 

The  model  that  we  introduce  allows  one  to  change  the  enumeration  once  he 
has  got  close  to  the  desired  result  so  as  to  adjust  the  range  of  variability  of  the 
enumerative  process  and  acquire  efficiency  in  computation.  As  we  will  see  this  brings 
about  a  qualitative  speedup  of  the  convergence  toward  the  solution.  Our  approach 
is  close  to  that  of  Solomonov  [81]  and  Rissanen  [69,70,71,72]  who  suggested  that 
the  index  of  the  enumeration  can  be  combined  with  a  measure  of  the  complexity  of 
the  sequence  to  get  a  good  description.  The  same  intuition  has  been  used  also  by 
De  Sajitis  et  al.  [33]  for  probabilistic  prediction  functions.  The  measure  selected  by 
our  approach  is  the  entropy  of  the  text  relative  to  the  description. 

The  delicate  point  of  high  computational  cost  involved  with  many  solutions  of 
learning  problems  has  been  addressed  most  properly  by  Valiant  [83]  with  a  definition 
of  what  is  learnable  in  terms  of  what  is  achievable  with  an  algorithm  which  runs  in 
time  polynomial  in  the  accuracy  parameter  h  and  in  the  various  size  parameters  of 
the  program  to  be  learned.  Our  model  is  quite  different  from  the  one  in  [83]  but  it 
is  not  difficult  to  find  the  right  relations.  The  algorithms  that  we  present  do  have 
a  parameter  which  corresponds  to  accuracy. 
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1.4.2     Pattern  Recognition  and  Pattern  Classification 

Pattern  recognition  and  pattern  classification  are  probably  the  oldest  fields  of  re- 
search closely  related  in  problems  of  learning.  The  idea  of  pattern  is  used  through- 
out the  present  work  in  its  Umited  formulation  as  ordered  n-tuple  of  symbols,  hence 
synonymously  with  string. 

Many  different  classifier  systems  exist  which  employ  different  standpoints.  See 
[35]  for  a  good  introduction.  Classifiers  are  systems  that  try  to  sort  a  set  of  instances 
according  a  preconceived  set  of  features. 

This  thesis  addresses  the  following  two  problems,  concerning  patterns  and  classes 
in  a  fundcimental  way: 

1.  We  are  trying  to  model  learning  so  in  some  way  we  need  to  do  both  clas- 
sification and  recognition  of  characteristic  patterns.  So,  what  are  the  good 
patterns  and  what  are  the  good  classifications  to  look  for,  i.e.  to  construct, 
in  order  to  have  a  productive  interaction  between  the  two?  In  other  words, 
what  are  the  conditions  that  sets  of  patterns  and  classifications  must  satisfy 
so  that  one  can  effectively  apply  these  two  operations  again  to  get  a  further 
improvement? 

2.  How  does  classification  fit  in  our  scheme  of  learning  as  efficient  memorization? 

Theorem  1  gives  an  answer  to  the  first  question.  The  second  question  is  answered 
in  Section  3.1.3. 

1.4.3     Hierarchical  Systems 

The  idea  to  construct  hierarchies  of  alphabets  with  the  goal  of  inductive  analysis  of 
a  text  was  first  employed  in  [9,10]^  which  also  provide  a  procedure,  the  Procrustes 
algorithm,  for  constructing  a  higher  level  code  for  the  text.  A  similar  standpoint 
lead  to  the  excision  procedure  [42]  independently  used  by  the  Linguistic  String 
Project  [74,39]  for  constructing  their  natural  language  grammar. 

3ln  this  dissertation  we  use  the  concept  of  alphabet  in  a  more  general  way.  Our  code  is  what  in 
[9]  is  called  alphabet. 
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The  Procrustes  vein  of  research  lead  to  the  conception  of  hierarchical  systems 
[8]  as  optimizing  self-organizing  systems.  A  great  deal  of  experimentation  [11,65] 
has  shown  that  the  theory  of  hierarchical  systems  describes  the  behavior  of  many 
natural  system,  like  monetary  systems,  human  population  systems,  and  natural 
languages. 

The  restdts  with  natural  language  give  the  heuristic  evidence  that  the  choice 
of  where  to  draw  level  subdivisions  is  arbitrary.  This  evidence  has  motivated  the 
present  research  for  a  general  homogeneous  approach  to  all  levels  of  language,  on  the 
gmdelines  of  [9].  A  similar  standpoint  is  taken  by  stratification  linguistics:  in  [58] 
we  can  find  evidence  for  the  claim  otherwisely  supported  in  this  thesis  (See  Theorem 
2)  that  a  lot  of  linguistic  facts  can  be  expressed  with  structures  constructed  with 
strings  and  sets. 

1.4.4     Language  Acquisition  and  Natural  Language  Process- 
ing 

This  section  provides  a  more  detailed  introduction  to  the  fields  where  the  results  in 
this  thesis  find  immediate  pragmatic  application. 

Our  problem  in  this  domain  is  language  acquisition.  It  can  be  formulated  in 
the  following  way:  how  to  build  a  device  that  learns  the  language  it  is  exposed  to, 
and  what  is  the  a  priori  machinery  that  is  needed  to  do  this.  This  problem,  with 
this  simple  formulation,  has  brought  about  endless  discussion.  [67,68]  give  a  good 
glimpse  on  it. 

Discussion  is  often  caused  by  misunderstcinding  or,  better,  different  understand- 
ings of  the  arguments  involved.  In  the  given  formulation  of  the  problem  there  are 
many  terms  which  allow  multiple  interpretations,  but  we  don't  want  to  get  into 
philosophical  arguments.  Let  us  simply  suppose  that  the  aim  is  to  construct  a  de- 
vice which  will  acquire  knowledge  about  language  in  the  form  that  is  needed  by  a 
natural  language  processing  system.  So  we  need  to  know  what  a  language  process- 
ing system  is.  The  following  general  description  of  a  language  processing  system  is 
intended  for  the  nonspecialist.  The  expert  should  bear  with  its  oversimpUfications. 
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<s> 

>   <SUBJ>  <VERB>  <OBJ> 

<STJBJ> 

*   <NSTG> 

<PN> 

>  P   <NSTG> 

<NSTG> 

>  <LNR> 

<LNR> 

— >  <LN>  N   <RN> 

<LN> 

y   <TPOS>  <APOS> 

<TPOS> 

— >  T   1  null 

<APOS> 

— *  AD  J   1  null 

<RN> 

>   <PN>  1  null 

<VERB> 

>   <LTVR> 

<LTVR> 

— y  <LV>  TV   <RV> 

<LV> 

— >  D   1  null 

<RV> 

— y  D   1  <PN>  1  null 

<LVR> 

y   <LV>  V   <RV> 

<OBJ> 

-^  <NSTG>  1  <TOVO>  1  nul 

<TOVO> 

y  to  <LVR>  <OBJ> 

// 


Figure  1.1:  A  Natural  Language  Granunar 

Most  current  systems  consist  of  three  parts:  a  syntactic,  a  semantic,  and  a  dis- 
course analyzer^  This  subdivision  already  gives  an  important  piece  of  information: 
syntax,  semantic,  and  discourse  analysis  are  beUeved  to  rely  mainly  on  different 
mechanisms. 


Syntax  Analysis 

Syntax  analysis  is  performed  with  the  use  of  a  grammar  and  a  dictionary  emebed- 
ded  in  the  system.  An  example  of  a  natural  language  grammar  is  given  in  Figure  1.1 
which  is  a  simplified  version  of  the  one  given  in  [39]. 

Symbols  outside  of  angular  brackets,  the  terminals  of  the  granmiar,  are  word 
classes  and  should  be  read  as 


ADJ:  adjective         D:  adverb         N:  noun 


*This  is  in  fact  the  chapter  organization  of  [39] 
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P:  preposition  T:  axticle  TV:  tensed  verb 

V:  untensed  verb 

The  following  is  an  example  of  a  dictionary. 

blue:  AD  J 

cheese:  N 

John:  N 

Uke:  AD  J,  P,  TV,  V 

likes:  TV 

Mary:  N 

with:  P 


When  the  system  is  presented  with  a  sentence,  say  "Mary  hkes  blue  cheese", 
the  system  searches  the  dictionary  and  finds  the  word  class  attached  to  each  word 
in  the  sentence.  Then  it  replaces  each  word  by  its  syntactic  class  and  tries  to  parse 
the  sentence  with  the  grammar  by  constructing  a  parse  tree.  Note  that  there  might 
be  more  then  one  choice  available  in  several  steps  of  the  process.  This  leads  to  the 
problem  of  finding  the  "right"  choices  or,  in  other  words,  the  right  parse  tree.  The 
solution  is  usually  pursued  in  the  direction  of  strengthening  the  grammar.  It  can 
be  achieved  by  imposing  additional  constraints,  like  agreement  in  number  between 
verb  and  subject  and  agreement  in  attributes  found  on  the  dictionary.  However, 
in  principle,  strengthening  the  grammar  can  always  be  achieved  with  additional 
grammar  rules. 

We  can  now  ask  a  few  questions  which  lead  to  the  arguments  that  motivated 
this  research.  First  of  all,  why  is  the  dictionary  not  incorporated  in  the  grammar? 
In  fact,  one  could  decide  that  word  classes  axe  variables,  and  dictionary  entries 
terminals  of  the  grammar.  Second,  many  of  the  restrictions  [39]  (the  constraints 
that  guide  the  applicability  of  a  rule)  concern  situations  that  are  usually  indicated 
in  the  suffixes  of  the  words,  like  number  or  tense.  So,  why  is  the  dictionary,  i.e. 
the  set  of  the  terminal  symbols,  not  pushed  to  the  higher  resolutions  of  roots  and 
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sufibces  or  even  to  the  level  of  letters?  The  previous  two  questions  ^^'^'^^'^'^ 
Lt  foUowins  way:  what  ,ulde<l  the  apparently  arbUrary  cho.ce  of  the  level  at 
which  to  freeze  the  terminals  of  the  grammar? 

The  best  reply  to  these  inquiries  is  that,  though  arbitrary,  the  choice  was  dic- 
Uted  y  lelnce.  For  historical  reasons,  information  about  the  drctronary  the 
It  was  chosen  was  readily  available  with  particularly  '""--^ .^ ^^^^^^^ 
Zeover,  choosing  the  terminus  at  a  lower  level  wo^d  P— ^ ^^X^ 
computational  complexity  of  the  parsing  procedure.  Leas,  but  no.  •-';-? 
pllns  are  concerned  with  writ.en  language  where  there  e.sts  -»«  ■^2" 
faTely  the  space,  which  indicates  the  end  of  a  word:  there  is  a  natural  endrng. 

This  answer  leads  to  a  new  quesfon:  What  are  the  mechanisms  that  guided  the 
bist?v  oTresearch  in  mathematical  and  computationai  Unguisfcs  to  concentra. 
history  ol  researcn  m  ,    .  •       u     f  t>,^  TH<rht  level  of  computationai 

on  this  level?     Is  it  just  because  that  is  about  the  right  level 

complexity? 

Obviously  much  information  is  lost  by  drawing  this  apparently  arbitrary  level 
Une°!o  mosi  serious  systems  try  to  integrate  the  other  levels.  Is  there  a  way  to 
treat  all  the  levels  homogeneously? 

Semantic  Analysis 

The  second  step  that  a  natural  language  processing  system  takes  is  semantic 
Jys  s  aims  at  attaching  a  meaning  to  the  tree  structure  constructed  durmg  U>e 
Ital  analysis  Stated  in  this  way,  the  problem  looks  hopeless.  What  >s  meamng, 
:: iTi  iUocated.  and  how  can  we  retrieve  it  from  the  structural  descnpt.on  are 
Iffiult  questions  which  are  not  very  weU  understood.  For  this  reason  researchers 
"describe  it  as  the  translation  of  the  sentence  into  a  formal  ^uage  w^  h 
ould  »be  unambiguous,  have  simple  rules  of  Interpretation  and  ,nferen«  an  .n 
particular  have  a  logical  structure  determined  by  the  form  of  the  sentence.     [39] 

The  formal  language  usually  chosen  is  that  of  logic.  The  sidceffect  to  its  mc. 
properties  quoted  earlier  is  that  it  imposes  a  certain  Uteral-mrndedness  to  the  sys 
::Z  Oftenthe  literal  interpretation  of  an  utterance  has  very  '''«;';  "J 
intended  one.  This  considerations,  in  fact,  led  to  a  vem  of  research  [2,27]  stemmmg 
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from  the  Speech  Acts  approach  to  philosophy  of  language  [3,76].  Is  that  the  vein  of 
contradiction  asking  the  question:  Why  should  we  translate  a  sentence  into  logical 
form  and  not  treat  it  the  same  way  human  beings  treat  it?  Do  they  treat  it  with 
logic? 

A  final  aaiswer  to  these  questions  requires  an  adequte  theory  of  language  under- 
standing. Unfortunately  we  still  miss  one.  In  this  situation  one  cannot  help  looking 
for  a  solution  to  the  difficult  questions  concerning  meaning. 

In  this  thesis  we  will  embrace  the  idea  that  meaning  is  global,  distributed  prop- 
erty of  a  given  expression.  We  call  this  the  holographic  hypothesis.  It  can  be 
exemplified  in  many  ways  by  considering  obvious  characteristics  of  language  like 
redundancy.  Consider  for  example  "W  r  ntrstd  n  rdndnc".  The  same  trick  can  be 
played  at  every  level  of  language,  not  only  that  of  letters.  It  can  be  done  with  syl- 
lables, words,  phrases,  sentences,  periods,  ...,  provided  the  context  is  appropiately 
enlarged. 

Discourse  Analysis 

We  have  introduced  enough  redundancy  so  that  there  is  no  need  for  the  expla- 
nation that  discourse  analysis  is  the  attempt  to  find  the  interconnections  between 
different  sentences  in  a  text,  once  again,  to  retrieve  its  meaning.  However,  we  can't 
help  asking  the  same  question:  why  is  it  considered  different  than  the  previous 
steps?  Why  are  the  tools  and  the  theories  that  are  used  different  ^? 

Let  us  go  back  to  our  language  acquisition  problem.  The  aim  is  to  construct 
a  device  which  will  acquire  knowledge  in  the  form  that  is  needed  by  a  natural 
language  processing  system.  As  we  have  seen,  currently  it  is  composed  by  quite 
differentiated  subsystems,  so  the  language  acquisition  task  is  different  depending 
on  what  subsystem  one  is  thinking  about.  As  a  matter  of  fact,  different  works  on 
language  acquisition,  starting  from  different  assumptions,  use  different  methods  and 
achieve  different  goals.  As  an  example  consider  [5]  and  [77];  both  claim  to  address 
the  language  acquisition  problem  but  their  results  and  methods  are  very  far  apart. 

^Actually  there  have  been  indications  that  discourse  analysis  can  be  achieved  with  the  same 
means  of  syntax  analysis  with  the  text  gTammaT  approach  but  they  are  not  very  effective  and  other 
methods  are  more  fashionable 
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This  thesis  is  based  on  the  understanding  that  the  various  steps  of  natural  lan- 
guage processing  -as  is  currently  approached-  are  instances  of  the  same  problem  of 
find^ng  the  ^nterconnecting  relaUons  between  sub-umts  that  form  larger  constructs. 
The  following  are  different  instantiations  of  the  previous  sentence  obtained  by  as- 
signing more  specific  words  to  the  general  concepts  units  and  b^gger  constructs. 

1.  find  the  interconnecting  relations  between  words  in  a  sentence; 

2.  find  the  interconnecting  relations  between  word  meanings  forming  a  sentence 
meaning; 

3.  find  the  interconnecting  relations  between  sentences  in  a  discourse 

■  The  general  problem  of  language  acquisition  becomes  that  of  constructing  the 
machinery  for  obtaining  these  relations.  In  general,  the  problem  of  learning  .s  that 
of  constructing  a  machinery  that  permits  the  discovery  of  the  relations  between 
sub-units  in  a  larger  structure.  The  present  thesis  gives  a  precise  formulafon  and 
a  workable  solution  to  the  general  problem. 


Chapter  2 


Semiotics 


The  Icinguage  introduced  in  tliis  chapter  aims  at  describing  the  concepts  and  the  sit- 
uations which  are  encountered  in  semiotics,  in  general,  and  linguistics,  in  particular. 
We  define  the  important  notions  of  semantics,  syntax,  interpret ational  scheme  and 
context  in  such  a  way  that  the  definitions  emphasize  their  interrelations.  Then,  we 
introduce  two  structure  constructors  to  point  out  the  connection  between  structure 
2ind  semantics.  We  will  see  how  structure  permits  representation  and  how  the  two 
constructors  interact  to  create  a  unique  structural  concept,  the  alphabet. 

We  introduce  a  computational  paradigm  and  we  prove  that  it  is  as  powerful  as 
any  other  established  computational  paradigms.  This  gives  evidence  to  the  fact 
that  the  two  constructors  are  sufficient  to  create  any  computable  structure. 

The  proposed  theory  will  be  rigorous  only  up  to  the  point  when  its  structure 
is  solid  enough  to  compare  the  notions  involved  with  other  related  ideas.  We  will 
see  that  the  intuition  that  we  have  built  is  in  haxmony  with  other  different  char- 
acterizations for  the  concepts  under  consideration.  It  is  not  by  chance  that  rigor 
is  synonymous  with  stiffness;  therefore  the  use  of  the  rule  to  define  and  work  out 
everything  is  not  employed  to  the  extreme.  The  more  we  proceed  in  the  study, 
the  easier  will  be  to  introduce  new  concepts  and  statements.  We  will  avoid  all  the 
tedious  details  when  they  could  be  easily  reconstructed  on  the  base  of  the  context 
and  of  previous  assertions.  "The  last  dotting  of  the  last  i,  in  the  manner  of  the  old 
fashioned  Cours  d'Analyse  in  general  cind  Bourbaki  in  particular,  gives  satisfaction 
to  the  author  who  understand  it  anyway  (...);  for  most  serious-minded  readers  it  is 
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worse  than  useless."  [41]. 

2.1      Symbols  and  Symbolic  Forms 

The  key  idea  is  once  again  to  split  in  halves,  to  impose  a  distinction  when  apparently 
there  is  not;  the  concepts  that  we  separate  are  "sign"  and  "symbol".  We  call  signs 
symbolic  forms,  in  order  to  free  the  reader's  imagination  in  interpreting.  Symbols 
and  symbolic  forms  are  the  elementary  notions  of  the  theory.  Unless  otherwise 
stated,  the  universe  of  symbols  will  be  disjoint  from  the  universe  of  symbolic  forms. 

Let  W  be  a  universe  of  symbols  and  C  a  universe  of  symbolic  forms.  A  constructor 
a;  is  a  function  a;  :  W  — >  £,  where  U*  is  the  universal  language  over  U,  i.e.,  the 
free  semigroup  over  U. 

Given  a  set  of  symbols  U  and  a  family  fi  of  constructors,  we  can  conceive  of  a 
symbolic  form  over  U  as  the  result  of  the  application  of  an  wefi  on  arguments  in 
U.  Let  £„(W)  =  {u:{s^,...,3„)\u;eCl,Sidl}  and  £+(W)  =  U~i  A(^/).  C+iU)  is  the 
set  of  all  the  possible  symboUc  forms  on  the  symbols  of  U  and  will  be  called  the 
hyperuranios  of  U. 

Constructors  should  be  thought  of  as  all  the  functions  that  allow  the  formation 
of  complex  symbolic  structures  from  an  ordered  set  of  symbols.  This  justifies  the 
name  hyperuranios;  with  it  Plato  denoted  the  space  beyond  the  celestial  spheres 
where  Ideas  reside. 

Example  1        1.  Let  f2  contain  only  the  string  constructor  [.],  then  C'^{U)  is  the 
free  concatenation  (or  the  universal  language)  over  U,  C^{U)  =  U*. 

2.  Let  fi  contain  only  the  set  constructor  {.},  then  C'^{U)  is  the  powerset  of  ZY, 
£+(W)  =  V{U). 


D 

Example  2   (i)  Let  U  contain  the  symbol  0  and  Q  the  only  element  a;.    Let  us 
suppose  that  w  is  injective.   Then  £"'"(W)  contains  infinitely  many  symbolic  forms. 
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namely  a;(0),w(0,0),....  Under  these  hypothesis  we  cannot  say  more  about  U  or 
£+(W). 

(ii)  However,  if  we  make  the  additioned  supposition,  as  many  mathematicians 
do,  that  C^{U)  C  II  then  we  can  easily  prove  that  U  as  well  contains  at  least 
countably  infinite  elements. 

D 

2.1.1      The  Semantics 

We  have  assumed  that  the  w's  taJce  only  symbols  as  arguments,  so  if  we  want 
to  apply  them  recursively  on  new  symbolic  forms  created,  we  have  to  associate  a 
symbol  sell  to  every  symboUc  form  u;(5i,52,...),5teZY  by  means  of  a  map  (f).  In  this 
way  any  symboUc  form  can  serve  as  a  symbol  itself. 

(f>  will  be  given  the  name  atomizing  function  to  suggest  the  idea  that  it  is  a 
function  that  allows  to  look  at  a  complex  symbolic  construction  as  one  single  atomic 
symbol. 

The  alternative  name  semantics  ^  is  justified  by  the  holographic  hypothesis; 
meaning  is  a  global  property  of  a  given  expression. 

Definition  1  An  atomizing  function  for  C'^{U)  or  simply  a  sem,antics  is  a  map 

<f>  :  C^{U)  — >  U 
satisfying  the  property: 

(j)\)         4>{u}{s))  =  5  for  each  ijjeQ,seU 

The  semaaitics  is  faithful  for  f2  if  it  is  one  to  one  (except,  obviously,  for  elements  in 
Ci{U):  by  property  (f)\  it  can  be  one  to  one  only  if  fi  is  a  singleton  set).  The  triple 
(W,  fi,  4>)  is  a  sem,antic  universe. 

D 


^The  etymology  of  the  greek  word  sima,  sign,  symbol,  roots  it  in  the  Indo-European  dhydmn  and 
the  Sanskrit  dhydna,  thought. 
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one  symbol  it  is  not  possible  to  express,  or  wnte  down,  anytm  g 

symbol  itself. 

1  +1,.  universe  of  symbols.    Therefore,  no  matter 

we  ^n  be  operating  o^y  ^^  ^   rd  is  Ibat  tbe  semantics  returns  in 
what  the  operations  m  U  do,  our  omy  behavior  of  the 

U.  TWore  .  faitHul  semantics  -^l^^::;:^..^^^  by  means  o. 
„.s.    Moreover,  the  hyperuramos  of  a  umverse  rs  ^  J  ^^^,i„^3 

.,e  class  a,  so  we  can  interpret  the  semantics  ,  as  the   em^t-  J_  ^ 

in  a.  By  mea^s  of  *  we  can  see  the  .'s  m  f>  as  operatrons       «. 
The  foUowing  examples  should  clarify  the  matter. 

Exarr^ple  3  Let  U  contain  (at  least)  two  symbols.  0, 1^  and  .e«.  Then  .  is  the 
operation  of  set  formation  if 

*(w(s.,...,s„))  =  *M»i.-.0) « {'..■■■.^">  =  W--^'-^        ''-'^ 

„  we  identify  the  set  (0, 1, ......  D  with  the  number  „.  .  cont^ns  aB  the  natural 

numbers  ani  (almost  aU)  the  finite  subsets  of  the  natural  numbers. 

A-r       M  makes  sure  that  the  set  contaimng  only 
Let  us  emphasize  that  condrt.on  *1  mak      su^  ^^^^^.^^  ^^  .^  ^^^ 

one  symbol  is  identified  wrth  the  ^='-^°'  "^'f'  J       ^       „,„  .^an  Od/.    The 
situation  of  Example  2  (i),  s.iU  we  could  not  sa    anytkng  m 

I,    +;,o  T^rP<;<»nce  of  two  elements  in  M. 
qualitative  jump  is  given  by  the  presence  oi 

o. .. .... .-  -  i.  - '■"■;;':;,rr:rrr  I- r- 


the  universe. 
D 


example  4  A  context-^ee  grammar  with  set  of  — ;^_-^;;';^  — 
^  defines  a  semant.c  universe  suppo^d^y  --^  ^jj,  1,. 
special  symbol  T  is  read  as  meamngless,  S  is 
In  order  to  see  it  we  need  the  following 
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Definition  2  A  -pruned  tree  of  a  tree  <  is  a  tree  obtained  from  t  by 

pruning  off  some  subtrees  but  leaving  their  roots. 

D 

Example  5   . 


A  tree  t 


A  pruned  tree  of  t 


Consider  a  context  free  grammar  G.  Let  Cl  contain  only  the  operations  of  string 
formation  and  set  formation. 

For  all  the  elements  3  G  C^{U)  define  (t>{s)  as  follows: 


•  if  5  is  composed  of  a  single  symbol  then  (i>{s)  =  s; 

•  if  6  is  a  string  and  can  be  read  off  the  leaves  of  some  pruned  subtree  ui, ...,  u„ 
of  some  derivation  trees  ti,  ...,tjn  of  G,  then  put  the  additional  new  symbols 
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w^,,...,w^„,w  in  S.  Let  o(a^,)  =  root(u,)  and  (t>{{u.\,,  ...,w^„])  =  w.  <?(s)  = 


ly 


otherwise  0(3)  =] 


Example  6  Let  {U,Q,4>)  be  the  semantic  universe  associated  as  in  Example  4  to 
the  grammar  G  for  natural  language  given  in  Section  1.4.4.  Denote  with  2  the  set 
{SUBJ,  OBJ,  VERB}  and  with  S  the  set  {S}.  Then 

1.  The  semantics  of  SUBJ  VERB  OBJ  is  S; 

2.  The  semantics  of  T  TV  is  NSTG; 

3.  The  semantics  of  TV  is  TV. 


The  are  many  possible  ways  to  show  that  model  theoretic  semantics  is  captured 
by  our  definition.  In  the  following  example  we  will  work  out  the  case  of  Prepositional 
Calculus  and  Predicate  Calculus. 

Example  7  As  a  first  step  let  us  consider  Propositional  Calculus;  we  will  see  how  to 
construct  a  semantic  universe  from  a  set  V  of  propositional  variables.  Let  U  =  V  D 
{=>,F,T,  t},  ^  contain  only  the  string  formation.  In  this  situation  C'^(U)  =  W.  A 
model  theoritic  semantics  0  is  a  truth  value  assignment  consistent  with  the  obvious 
laws  of  logic^  111  formed  formulas  are  assigned  the  value  t-  So  <p  :  £"^(^)  ->  {F,  T,  T 
}  QU 

Let  us  now  consider  Predicate  Calculus.  Given  a  set  C  of  constants,  a  set  P  of 
predicates,  a  set  V  of  variables  we  define  t/  =  C  U  P  U  V  U  {F,  T,  =>,  V.  T},  f^.  and 
4)  as  above. 
D 

^The  set  {^,F,T}  is  just  a  possible  minimal  set  of  symbols  for  Propositional  Calculus. 
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2.1.2      The  Syntax 

A  widespread  belief  that  motivated  this  research  is  that  syntax  and  semantics  can 
be  classified  as  different  things.  This  belief  is  somewhat  misleading.  Syntax  and 
semantics  are  different,  but  they  also  bear  a  lot  of  connections:  in  this  section 
we  define  syntax  as  the  inverse  function  of  semantics.  This  makes  it  possible  to 
understand  why  learning  can  be  explained  both  as  the  construction  of  a  semantics 
and  as  the  construction  of  a  structural  description. 

Let  {U,Q,,<f))  be  a  given  semantic  universe.  Consider  the  equivalence  relation  i 
on  C+{U): 

X  i  y  ^  4>{x)  =  <P{y)  (2.2) 

and  denote  with  C{U)  the  quotient  space 

C{U)  =  £+{U)/l 

Then  there  exists  a  unique  bijective  map  (j)  such  that  the  following  diagram  is 
commutative. 

U 

£+(W)  -^  C{U) 

nat  I. 

Definition  3  Let  (W,  fi,  (f>)  be  a  semantic  universe.  Since  ^  is  a  bijection  then  we 
can  consider  bijective  map  il}  =  <j>  ,  ip  :U  ^-^  C{ll).  if}  is  the  reference  function,  or 
simply  the  syntax  of  U. 

Let  S  C  2/.  The  function  rp^-.U  -^  £(1))  defined  as  V-sC^)  =  V(^)  n  £(11)  is  the 
reference  function  of  14  restricted  to  S. 

Let  So,  Si  C  U.  So  is  sufficient  for  Si  if  V'So('S)  7^  0  for  each  aeSi 

D 

Example  8  Let  ijU,  fi,  (j))  be  the  semantic  universe  associated  as  in  Example  4  to 
the  grammar  G  given  in  Section  1.4.4.  Denote  with  Z  the  set  {SUB J,  OBJ,  VERB} 
and  with  S  the  set  {S}.  Then 
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1.  The  syntax  of  NSTG  restricted  to  W  is 

V,w(NSTG)  =  {TN,NPN,T  AD  J  iV, ...} 

2.  The  syntax  of  S  restricted  to  Z  is  M^)  =  (S^^J  VERB  OBJ}; 

3.  The  syntax  of  S  restricted  to  W  is  Vw(S)  =  {N  TV,  N  TV  N,  T  N  TV  -}; 

4.  W  is  sufficient  for  U; 

6.  Z  is  not  sufficient  for  U  but  is  sufficient  for  S. 


2.1.3     The  Interpretational  Scheme 

It  is  .  widespread  convention  to  caU  any  set  of  symbols  an  alphahef  .  We  have 
seen  in  the  previous  section  that  if  a  semantic  is  assigned  to  our  universe  then  every 
.ymbol  of  an  alphabet  has  attached  to  it  a  class  of  symbolic  forms.  We  w,ll  see  now 
how  the  context  permits  to  select  the  relevant  ones. 

Definition  4  An  interpretation:^!  ,eheme  A,  or  simply  a  scheme,  is  an  ordered  pair 
of  alphabets  A  =  (X,X)  in  a  semantic  universe,  X,XCU.XSs  the  eonte.  of 
A  tX  might  have  no  relation  with  X  in  general.  We  choose  the  overhne  notation 
because  it  is  suggestive  of  the  operations  that  we  wiU  introduce  in  the  foUowmg 
with  which  it  is  consistent). 

We  will  say  that  A  is  over  X  and  we  wiU  call  SmlV-x)  the  supvort  set  for  A. 

-^  •      1.       ■    -f  .1,    ( A  -  0     X  scheme  is  atomic  if  all  its  elements 
An  element  seX  is  atomic  if  i/'x(«)  -  •»•   A  scneme 

are  atomic. 

%;,^— in:;;7j;7;;;— n^^  informally.   Later  on  we  will  find  a  way  to  formaUze  it  in 
accordance  with  the  commonly  used  concept. 


27 


Definition  5  A  semantics  is  faithful  for  the  scheme  A  =  {X,X)  if  for  each  xeJC, 
^(i)  contains  exactly  one  element. 

Usually  all  the  schemes  that  we  consider  live  in  the  given  semantic  universe. 
When  the  semantics  is  faithful  for  the  scheme  considered,  we  will  say  that  the 
scheme  is  faithful. 

a 

Example  9  Consider  the  semantic  universe  associated  to  the  natural  language 
grammar  defined  in  Example  6.  Then  ^  is  faithful  for  the  scheme  (<S,2),  but  (f)  is 
not  faithfvd  for  the  scheme  {U,T). 

The  schemes  {T,T)  and  {Z,Z)  are  atomic. 
D 

Example  10  Let  X  =  {1,2,3,...}  be  the  set  of  positive  natural  numbers,  Xi  = 
{I,V,X,L,D,M}  the  set  of  the  roman  numerals,  and  X2  =  {0,1}.  Let  fl  contain 
the  operation  of  string  formation,  then: 

V'x.(l)  =  /  V';c.(2)  =  //  ^xA^)  =  III 

V'x,(l)  =  l  ^x,(2)  =  10  V'x,(3)  =  ll 

V'x,..,(4)  =  {1^,100}    Vx,.;c,(5)  =  {^,101}    V.,ux.(6)  =  {l^/,110}    ... 


a 


Contexts  can  be  redundant;  it  might  happen,  in  fact,  that  ij)x{X)  C  C{Xo)  with 
Xq  C  X.  So  we  might  need  the  function  A  which  taJces  an  interpretational  scheme 
A  as  argument  and  returns  the  smallest  context  which  A  is  over: 

A{A)  =      J]        X' 

A{A)  is  called  the  alphabet  of  A. 

Remark  1  Often  a  scheme  A  =  {X,X)  is  identified  with  the  set  X.  This  entails 
that  we  overload  the  set  theoretical  symbols  e,  C,  U...,  and  their  associated  concepts. 
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U  A  =  (X,X)  is  a  schen^e  then  s.A  mean.  s.X  if  .  is  a  symbol  and  s#(X)  it 
«AX)    Analogous  situations  hold  fo.  the  other  set  theorefcal  symbols. 


D 


2.2      Strings  and  Sets 

We  have  introduced  the  notion  of  semantics  and  we  have  seen  that  it  gove^s  the 

.        ■     o        1^=«  U  is  faithful    In  this  section  we  assume  that  the 

.suit  of  the  ^^-"^;;^-^:''^,^,^^  i„  n,  namely  the  operations 

::::::ra:dtar:t::^^^^^^^^^^ 

:  land  that  the  same  situat.on  can  be  expressed  by  saying  that  0  co^.ns  o^ 
the  set-former  and  string-former  and  that  the  semantics  .s  faithful.  In  fact, 
exposition  to  foUow,  we  wiU  take  the  second  viewpomt. 

In  the  sequel  we  wiU  see  how  the  structures  constructed  with  the  set-former  a^d 
the^tring  firmer  can  be  used  as  means  of  representation.  The  maan  result  of  th. 

toHs  Theorem  1.  It  gives  the  condition  under  which  set  format.on  and  strmg 
section  IS  ineorem  l.  ±^  &  r     i  u   u^t    TV.«»  nntion  of 

formation  interact  to  create  the  more  general  structure  of  alpUhet.  The  not.on  ^ 
ITabet  that  wiU  be  introduced,  generalizes  the  commonly  intended  concep  In 
Sicl  LmaU.es  the  idea  of  level  and  aanfies  the  discussion  and  the  .u.st.ons 
proposed  in  Section  1.4.4. 

The  arst  question  that  comes  to  mind  is:  is  it  possible  to  do  anything  interesting 
only  with  tHs  two  construc.orsT  The  answer  will  come  in  a  later  seC.on  the 
111  constructed  in  this  way  are  equaUy  expressive  as  any  other  computat.onal 

structure. 

Then  one  might  ask  why  we  should  be  interested  in  constructing  a  theory  with 

two  const  uetor!,  when  we  already  have  a  very  good  one,  set  '^-y  "hach  .s  very 

Zcessful  in  doing  everything  only  w.th  the  set-former.  In  fact,  usmg  the  weU  kn  wn 

wLIer  and  Kuratosky  definition  of  ordered  pair  157).  or  any  other  eqmvalent  one, 

it  is  possible  to  recursively  define  the  concept  of  n-ple  or  stnng. 

There  are  many  reasons  wh,  we  introduce  this  formaUsm.  First  the  theory  based 
on  Tese  two  concepts  has  a  part.cularly  appealing  intuition  common  to  different 
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disciplines  (see  Section  2.3.1).  Second  the  recursive  definition  of  n-ple  based  on  the 
Wiener- Kuratosky  definition  of  ordered  pair  has  the  computational  flaw  that  there 
are  n  calls  to  the  first  element,  n  —  1  calls  to  the  second,  ...  Since  our  hypothetical 
applicative  need  will  mainly  be  that  of  recursively  determining  the  way  a  partictdar 
symbol  has  been  constructed,  then  when  we  are  dealing  with  a  string  the  process  is 
not  paraUelizable  or,  from  another  point  of  view,  is  computationally  costly.  Third, 
as  we  wiU  see  the  notion  of  string  is  fundamental  for  the  further  developments 
of  this  work  in  the  following  chapters.  Finally,  two  is  better  than  one,  we  are 
in  the  situation  where  we  choose  a  binaxy  over  a  unary  representation''.  See  also 
Example  3. 

2.2.1      Codes  and  Classifications 

From  now  on  17  =  {u;i,a;2}  where  Wi  is  the  constructor  of  string  formation  and  0^2  is 
the  constructor  of  set  formation.  We  adopt  the  notational  convention  a;i(5i,  ...,5„)  = 
[si,...,5„]  or  simply  5i...5„;  a;2(.Si,  ...,5„)  =  {5i,...,5„}.  Note  that  the  two  construc- 
tor take  only  symbols  as  arguments,  so  sets  are  always  flat. 

Given  a  set  of  symbols  S,  'P(I))  denotes  the  powerset  of  S,  S*  the  free  semigroup 
over  S.  It  should  be  noted  that,  if  we  carry  out  the  construction  of  £(I1)  under 
any  semantics  (j),  by  property  (f)!,  the  string  [s]  composed  of  the  only  element  s  is 
identified  with  the  singleton  set  {5}. 

At  this  point  we  need  the  assumption  that  the  semantics  of  the  universe  is 
faithfvd  for  all  the  scheme  that  we  will  consider.  Our  main  interest  in  schemes  is 
connected  to  their  action  on  texts,  (see  following  Definitions  8,  9,  and  10).  We  will 
see  in  the  following  that  the  restriction  to  faithful  semantics  does  not  introduce  any 
loss  of  generality. 

Remark  2  When  the  semantics  is  faithful,  we  can  have  a  clear  picture  of  what  jC(S) 
looks  like:  it  contains  isomorphic  images  of  7'(I1),  S*,  and  S  (denoted  by  abuse  of 
notation  with  the  same  symbols).  Moreover  £(S)  =  7'(I1)US*  and  'P(2)ni;*  =  E. 

D 


*We  should  also  mention  that  the  designers  of  SETL  [75],  the  programming  language  for  set 
theory  have  chosen  to  allow  two  different  elementary  data  structure,  the  string  and  the  set. 
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„  .oilow.  fron.  Re^na..  2  tha.  a  scheme  can  alway.  be  conceived  of  .  .he  nnion 
of  two  different  parts:  a  code  and  a  classificafon. 

i.e.  ^(5)  C  A{Sy 


a 


A(Q\\nthe  common  sense,  where  each  word 
A  code  5  is  a  formal  language  over  A{S)  m  the  comm 

w  is  assigned  the  name  xp-^{w). 

Exa„,p.e  XI  Consider  the  semantic  universe  associated  to  the  natural  ,an,uas. 

grammar  as  defined  in  Example  6. 

The  schen^e  (S,Z)  is  a  code.  The  reader  can  try  as  an  exercise  to  find  a  schenae 

which  is  not  a  code. 


D 


Example  1.  A  more  informal  but  interesting  example  of  code:  the  set  of  all  pro- 
cedures with  no  argument  of  a  compiled  Pascal  program. 


D 


1      \A    -   ( A  M)  where  A  is  the  english 
Example  13  The  Morse  code  ,s  as  code  M   -(AM  ^^^^^^ 

alphabet  and  M  =  {•,  -  ,  ^P-e}.   It  .s  assumed  that  the  encoding 
ends  with  a  space. 


a 


In  this  theory,  the  counterpart  of  the  code  is  the  classification. 
Definition  T  A  classification  is  a  scheme  C  such  that  its  support  set  is  a  class  of 
sets,  i.e.  i'lC)  C  V(A{C)) 

a 

aema.  3  A  classification  C  can  be  ^ -: -:::^::^1Z 
the  sets  in  ^{C)  are  pairwise  disjomt  then  C  is        eqm 
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is  the  set  of  the  classes  of  sm  equivalence  relation  on  A{C). 

a 

Example  14  Consider  the  set  of  syntactic  classes  W  of  our  natural  language  ex- 
ample.   W  is  a  classification  over  the  entries  of  the  dictionary:    to  each  element 
X  G  W  is  associated  the  set  of  all  the  entries  which  have  X  as  attribute. 
D 

Proposition  1  Let  A  be  a  faithful  scheme  and  Aq  =  A{A). 

(i)    There  exist  a  code  S  and  a  classification  C  such  that 

1.  A{S)  C  Ao  and  A{C)  C  Ao 

2.  A  =  SUC 

3.  S  0  C  is  an  atomic  scheme. 

(ii)  There  exist  a  unique  code  S  and  a  unique  classification  C  which  do  not  contain 
atom,ic  elements  and  A  =  SUCUU  where  U  is  the  set  of  the  atomic  elements 
of  A. 

Proof:  (i)  Let  S  =  {xeA\tlp{x)eA*}  and  C  =  {xeA\i};{x)eV{A)}.  1.  and  2.  are 
obvious  (see  Remark  2).  3.  follows  from  Remark  2  and  property  <f)l. 

{ii)  Define  S  =  S  —  U  and  C  =  C  —  U.  Existence  follow  from  (z)  above.  To  prove 
uniqueness  we  just  need  to  observe  that  S  (C)  does  not  contain  atomic  elements. 
D 

Definition  8  A  string  T  of  elements  from  an  alphabet  A  is  called  a  text  over  A. 

a 

The  goal  of  the  following  three  sections  is  to  define  the  action  of  a  scheme  over 
a  text.  We  will  construct  this  definition  first  by  considering  how  a  code  and  a 
classification  can  operate  on  a  text,  and  then  by  studying  the  ways  they  interact 
with  each  other. 
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2.2.2     The  Action  of  a  Code  and  a  Classification  over  a  Text      ^ 

1   •     v,^w  a  rode  and  a  classification  can  operate 


on  a 

faithful. 


Definition  9  Let  S  be  a  code 

,       1   1,  u^t  r.f  <?   MS)  then  an  encoding  T^  of  T  under  5 
1     K  r  is  a  text  over  the  alphabet  ol  ^ ,  A.{S ),  xncu  _5  ^  .        , 

IL  1  5  such  th..  if  we  replace  each  symbol .  i.  T^  wi.h  .ts  a.soc.a.ed 
::  tea  ,  =  *W  we  ge.  r.  5  is  an  „n.,„.<v  ^«.,*e™Me  c».e  lo.  T.  o. 
tI  „n.,.c,,  .ecl6!e  by  S  if  there  erists  oae  and  only  one  encoding  T  ^  S 
T  .s  umgue  V  decipherable  code  for  any  word  of 

is  an  nnambtguous  code  if  it  is  a  uniquely  y 

i,(sy. 

,    „  r  is  a  text  over  5  then  the  ,pe,«n,  Xs  of  T  under  the  code  S  is  a  text  over 
■  the  alphabet  of  S,  AS),  such  that  T  is  an  encoding  for  it. 


D 


Example  15  Let  M  be  the  Morse  code  as  in  Example  13.  Let 


r  = •- 

then  T^  =MORSE 

D 

Remark  4  We  can  generalize,  in  two  different  ways,  the  encoding  of  a  text  T  by 

ride  S  even  if  T  is  constructed  over  an  alphabet  larger  than  the  alphabet  of  S. 

1.  We  assume  that  ^1  the  unknown  symbols  of  T,  i.e.  the  ones  which  are  not  in 
the  alphabet  of  S,  are  deleted  beforehand. 

2.  We  assume  that  the  unknown  symbols  of  T  are  left  unchanged  through  th 
operation. 
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The  second  case  requires  a  little  more  attention  regarding  the  convention  to  take 
when  an  unknown  symbol  appears  within  a  codeword. 

Analogous  assumption  can  be  made  with  spelling  and  the  other  operations  in- 
troduced in  the  following. 

a 

Definition  10  Let  C  be  a  classification. 

1.  If  T  is  a  text  over  the  alphabet  of  C  then  a  generalization  T  of  T  under  the 
classification  C  is  a  text  over  C  such  that  there  exists  a  way  to  replace  each 
symbol  c  of  T     with  an  element  in  its  associated  set  ^(c)  so  as  to  get  T . 

2.  If  T  is  a  text  over  C  then  a  specialization  Tjc  of  T  under  the  classification  C 
is  a  text  over  the  alphabet  of  C  such  that  T  is  a  generalization  for  it. 


D 


Remark  5  If  C  is  an  equivalence  relation  over  A.{C)  then  for  any  text  T  over  A{C) 
there  exists  one  and  only  one  possible  generahzation  under  C . 

D 

Encoding  and  generalization  are  multivalued  unary  operations  appUcable  on 
texts.  We  will  sometime  abuse  the  operational  notation  and  identify  their  result 
with  only  one  element  in  its  set  of  values.  Encoding  can  be  undefined  if  the  code 
does  not  pajse  the  text.  When  we  use  encoding  as  an  operation  we  wiU  tacitly 
assume  that  that  is  not  the  case.  See  also  Remark  4. 

If  X  is  a  code  (or  a  classification)  the  notation  T  will  be  read  "T  represented 
in  X".  In  general  the  superscript  teUs  what  the  text  is  represented  in. 

Example  16  Consider  the  classification  W  of  the  word  classes  over  the  dictionary, 

constructed  as  in  Example  14.    Let  T  be  the  sentence  "John  likes  blue  cheese". 

w 

Then  T  represented  in  W  is  T     =  N  TV  AD  J  N. 

a 


34 

Definition  11  A  scheme  A  =  5  U  C,  where  5  is  a  code  and  C  a  classification, 
is  unambiguous  if  5  is  a  unambiguous  code  and  C  is  an  equivalence  relation  over 
A{A). 

a 

2.2.3      The  interaction  of  Codes  and  Classifications 

We  have  just  survived  a  tedious  Ust  of  definitions  and  constructions  aimed  at  arriving 
at  the  following  keypoints: 

•  An  interpretational  scheme  is  composed  of  two  disjoint^  parts:  a  classification 
and  a  code. 

•  Classifications  and  codes  independently  provide  a  representation  for  a  given 
text. 

The  goal  of  the  present  section  is  to  construct  in  an  analogous  way  the  repre- 
sentation of  a  text  given  by  a  scheme.  To  do  this  we  wiU  look  for  the  conditions 
under  which  the  code  part  and  the  classification  part  of  a  scheme  interact  nicely  to 
give  a  unique  outcome.  The  findings  in  this  section  are  summarized  in  Theorem  1: 
a  scheme  provides  a  unique  representation  when  it  is  unambiguous  and  complete. 

As  a  first  step  towards  that  goal  we  need  to  observe  that  a  text  is  simply  a  string 
so  we  can  immediately  broaden  the  appUcation  of  the  generaUzation  operation  to 
codes.  This  also  allows  the  definition  of  the  encoding  operation  on  classifications  as 
it  is  done  in  the  following 

Definition  12  Let  A  =  5  U  C  be  a  scheme,  where  5  is  a  code  and  C  is  an  equiva- 
lence relation  over  A{A). 

a)  Let  5^  =  {ipisf  :  seS}.  Note  that  S^  is  a  code  over  C. 

b)  Denote  with  C^  the  equivalence  relation  over  S  defined  as 


sC    t   ^  i^{s)     =rp{t) 


'In  the  sense  of  Proposition  1 
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D 


Definition  13  Let  A  =  S  U  C  a.  scheme.  The  completion  '^ 5  of  5  by  C  is  the  code 
with  support  set 


er 


S  is  complete  with  respect  to  C  if  ^5  =  S.  The  completion  of  A  is  the  scheme 
A  =  ^S  UC.  A  is  complete  iiA  =  A. 

a 

Proposition  2  Let  A  =  S  U  C  be  a  com,plete  schem,e  (C  an  equivalence  relation 
and  S  a  code  over  the  alphabet  of  A,  A{A))  and  T  a  text  over  A{A)  IfT  is  uniquely 
decodable  by  S  then  T     is  uniquely  decodable  by  S   . 


Proof:  Let  T  =  ai...a„  a  text  over  A{A),  T  =  Ci...c„,  and  cr  =  (ci  :  ai)...(c„  : 
an)  the  substitution  such  that  cr{T  )  =  T.  By  contradiction,  let  ■s'^.-.^J,  and 
s"...s'l  be  two  different  encodings  of  T  under  S  .  Since  5  is  complete  with  re- 
spect to  C  then  a{ij}{s'-)),<7{'tl;{s"))eS  for  each  I   <   i   <   h,l   <  j   <   k.     Hence 

V'-H<^(V'(4)))-V'"H<^(V'«)))  and  V-'(o-(V'«)))-V'''(^(V'K)))  are  two  different 
encodings  for  T,  a  contradiction. 

D 

Corollary  1  Let  A  =  S  U  C  be  a  complete  scheme,  (C  be  an  equivalence  relation 
and  S  a  code  over  A{A)).  S  is  an  unambiguous  code  if  and  only  if  S  is  an 
unambiguous  code. 

a 

The  results  in  Proposition  2  and  Corollary  1  do  not  hold  if  we  drop  the  hypothesis 
of  completeness,  as  it  is  shown  in  the  following  exemiple. 

Example  17  Let  A  =  S  U  C  where  S  =  {ac,a,d}  and  C  =  {{a},{c,d}}.  S  is 
unambiguous  but  S     =  {A,  AC,  C}  is  not  unambiguous.  In  fact,  S  is  not  complete. 
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Its  completion  is  ^5  =  {ac,a,c,d}  which  is  not  unambiguous. 


D 


Suppose  now  that  we  have  an  interpretational  scheme  A  =  S  uC  and  a  text 
r  We  can  represent  T  in  S  or  in  C.  Then  we  can  apply  C  and  S  on  these  two 
representations.  The  foUowing  theorem  states  that  the  results  obtained  are  .dent.cal 
modulo  a  canonical  equivalence. 

Defimtlon  14  Two  sets  of  symbols  A  and  A'  are  e,.,.a,en.,  A  ~  A'  if  there  exists 
a  biiective  function  *  :  A  ^  A'  called  e,nMe.ce.  If  *  :  A  -  A'  i^  -  ^^-''^"- 
Jt'  =  a,...^  a  text  over  A  we  define  ^(T')  =  Ha,)...^M  =  T'  .  Confus.on 
is  avoided  by  the  superscripts  denoting  the  alphabet  the  texts  are  constructed  upon. 
Two  texts  T,  T  over  the  alphabets  A,  A'  are  e,.Mer,i,  t-TMA-  A'  with 
equivalence  4>  and  </>(^)  =  ^ 


D 


Theorem  1  le.  A  =  SUC  .n  unamSijuou-  complete  ,cAeme.  W.n  t/.ere  eri.<.  « 
c„„„„.c„/  e,uiWe„ce  6e<»..n  Me  .p,e,e„,.(.on  ofT^  H  S  ani  tke  repre.e„<a,.on 
ofT^  by  C^  for  any  text  T  over  the  alphabet  of  A. 


^   .c' 


Proof:  We  need  to  show  that  here  exists  a  canonical  equivalence  4>  :  S 
such  that  .        -c  \         ?^^ 

for  any  text  Tei/'(5')* 

Define  <f>:S^-^C    as 

Let  Te^iSy  and  T^  =  ......  is  the  unique  encoding  of  T  byj:  T  =  ^(..MKf .) 

and  so  r-    ==  ^(70"...^W'.     By  Definition  12  (a)  of  5     we  have  ^(..)      ^ 
V'(5^),l  <i  <  m.  Hence 

there  exists  i,  6  s"  such  that  VK^  =  V-l^.),  l<i<m  (2.4T 
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So  Sx—Sm.  is  an  encoding  of  T     by  5   .    By  Proposition  2  it  is  the  only  one,  so 
On  the  other  hand,  by  Definition  12  (6), 

o 

there  exists  c^  G  C    such  that  5j6'0(cj),  1  <  i  <  to  (2.5) 

g  _        _  5^ 

hence  if  we  consider  again  T    =  Si...Sm  we  have  Ci...Cm  =  T 

We  only  need  to  show  that  0(5j)  =  Cj,  1  <  z  <  7n.  By  (2.3)  we  have  that 


<f>{si)  =  C,  V(C)   =  0651V'(0       =  Hh)} 


(2.4)  implies  that  V'(^t)  =  V'('5i)     hence  Sjec.    Since  C    is  an  equivalence  relation, 

(2.5)  imphes  that  c  =  Ci 


2.2.4     The  Alphabet 

The  canonical  equivalence  in  the  previous  theorem  permits  to  define  the  action  of 
an  unambiguous  comiplete  scheme  on  a  text.  In  fact  we  can  conventionally  change 
into  identical  symbols  the  ones  in  the  X  of  5  and  C  which  are  in  correspondence 
through  the  eqmvalence  (f)  constructed  above.  In  that  situation  Theorem  1  guaran- 
tees  that  the  resvdt  of  applying  S  on  T  is  the  same  as  that  of  applying  C  on 
T  .  This  leads  us  to  improve  Definition  4:  when  we  live  in  a  semantic  world  all  the 
symbols  have  meaning.  The  following  definition  is  also  an  attempt  to  formalize  the 
notion  of  alphabet,  intended  in  the  linguistic  sense  of  the  term. 


Definition  15  An  alphabet  is  a  triple  (X,i/',>1)  where  ^4  =  5  U  C  is  a  complete 

Q 

unambiguous  scheme,  X  is  a  set  of  symibols,  and  tp  a  bijective  function  il?  :  X  "-^  S 

c 

(or  equivalently  ^  :  X  *— >  C   ). 
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We  win  aw  tHe  notation  and  conventionally  wnte  .  =  SUC^eve^  fo.  alphabets 
It  is  understood  that  the  set  of  symbols  in  A  is  the  one  of  S     (C   ). 
understood  that  alphabets  are  unambiguous  and  complete. 

,.«•*•        1ft  Tet  A  -  S  U  C  be  an  alphabet.    An  interpretationT^  of  a  tejct 
Definition  16  Let  A  -  o  u  ^^  r  .      ,  ,  i   v,^  "^  IC   ^  on  T 

r  under  the  alphabet  A  1.  the  text  over  A  obta,ned  by  applying  S     (C   )         T 
r;  1W«-'-  i»  '^^  '-'-  "-"^"O"  °^  interpretation.   Explanation  can  be 
multivalued. 


D 


Example  18  Let  T  be  a  text  forined  by  juxtaposition  ot  sentences  on  W  generated 
by  our  natural  language  granunar,  take,  for  instance, 

r^TADJNTVADJNPADJNTNTVNPNTVtoVD 

Change  the  gra^nar  by  modifying  the  <P»>  and  the  <TOVO>  production  in  the 
following  way: 

<PN>  ••:=  P  <NSTG1> 

<TOVO>       ::=  to  <LVR>    <0BJ1> 


and  by  adding  the  productions 

<NSTG1>  :••=  <LNR> 

<0BJ1>  "•=  <NSTG>  I  <TOVO>  1  null 


construct  the  semantic  universe,  .,  5,  and  T  as  in  Example  6^  In  he  f  - 
two  example  we  wiU  abuse  the  overUne  superscript  notation  introduced^  Thanks 
Theorem  1  it  is  possible  to  collapse  a  large  stack  of  overline  superscripts,  e.g. 

— i-B^         —C 
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1.  T  interpreted  in  5  is  j     =  S  S   S 


2.  T  interpreted  in  Z\s7^  =  SUBJ  VERB   OBJ  SUBJ  VERB   SUBJ  VERB  OBJ 


Example  19  The  entries  of  a  dictionary  of  a  language  is  an  alphabet  on  the  context 
of  aU  the  roots  and  suffixes.  The  words  are  classified  by  chosing  one  representative 
for  all  the  declensions  of  a  word.  The  dictionary  itself  describes  the  semantics  of 
the  entry  by  explaining  some  of  its  syntaxes  in  the  context  of  aU  the  words  of  a 
language. 


2.2.5      Hierarchies  of  Alphabets 

Let  us  now  consider  omx  formulation  of  the  learning  problem  as  the  construction  of  a 
semantics  for  the  reality  under  observation.  We  are  now  able  to  understand  how  to 
take  this  definition:  the  goal  is  to  construct  a  semantic  universe  which  contains  the 
symbols  in  the  text  observed.  We  axe  now  in  the  situation  where  there  are  only  two 
constructors  -remember  that  in  the  next  chapter  we  wiU  appreciate  their  strength- 
so  it  is  easy  to  devise  a  way  to  proceed. 

•  Construct,  in  an  appropriate  way,  an  alphabet  over  the  set  of  symbols  present 
in  the  text,  i.e.,  construct  a  classification  and  a  code  and,  for  each  set  or  string 
s  constructed,  create  a  symbol  s  and  assign  V'(^)  =  ■s; 

•  Interpret  the  text  under  the  alphabet  constructed  and  proceed  both  recur- 
sively and  at  the  same  level  by  constructing  another  appropriate  alphabet. 

The  goal  of  the  next  chapter  is  to  determine  what  appropriate  should  be  taken 
to  mean.  This  section  clarifies  what  are  good  families  of  alphabets. 

Since  the  reference  is  the  inverse  function  of  the  semantics,  if  one  constructs  a 
well  behaved  reference  on  a  universe  of  symbols  U,  that  wiU  define  a  semantics  on 
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A  good  family  of  alphabets  is  one  that  defines  a  weU  behaved  reference,  i.e.    a 
family 

such  that  for  every  ijel^i  ^  j,  then  sell  n  T,  impUes  M  =  V,!^)-  In  fact,  in 
this  situation  we  can  construct  a  reference:  let  S  =  Uui  Ai  and  U  -  [Jul  A{Ai); 
define  t/.  •  S  -  £(W),  in  such  a  way  that  ^(.)  =  M^)  if  ^^>1^  Then  there  exists  an 
extension  of  U  and  an  extension  <i>  of  ^^  such  that  (W,  4>,  fi)  is  a  semantic  universe. 

In  Chapter  3,  however,  we  focus  mainly  on  hierarchical  famihes  of  alphabets: 

Definition  17  A  family  {Aj-.-^o  ^i  alphabets  is  hierarchical  if  it  is  a  semantic 

universe  and  A{Ai)  C  Ai_i,  1  <  i  <  n. 

D 

Hierarchical  families  have  the  nice  property  pointed  out  in  the  following 

Proposition  3  Let  {A^Y.^o  ^^  «  hierarchical  family  of  alphabets.    Then 
seAi  n  Tj  implies  i>i{s)  =  rl^i{s)  =  s,  for  every  0<i^j<n. 
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2.3      Structures 

Structure  is  one  of  the  universal  notions  common  to  several  scientific  discipUnes. 
This  section  is  a  small  digression  to  relate  the  structures  that  we  have  just  intro- 
duced with  algebraic  and  computational  structures.  We  wiU  not  do  the  same  with 
structures  in  logic.  This  field  is  so  closely  related  that  even  a  quick  glimpse  to  it, 
is  far  too  laborious  a  work.  As  far  as  algebraic  structures  are  concerned,  we  will 
be  content  to  note  that  monoids,  sets  with  an  associative  operation  and  a  unit,  are 
probably  the  most  studied  structures  in  algebra:  other  structures  are  constructed 
in  terms  of  those.  The  interesting  relation  with  our  construction  is  that  any  monoid 
can  be  represented  as  a  free  monoid  (the  universal  language  in  computer  science 
jargon)  and  an  equivalence  relation  on  its  words  [47]. 
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2.3.1      Computational  Structures 

In  Computer  Science  the  structure  is  represented  by  the  prograxn,  or  by  any  other 
equivalent  computational  unit  like  grammax,  Turing  Machine  and  recursive  function. 
The  question  that  one  might  ask  is  whether  alphabets  ajid  schemes  are  equally 
expressive. 

The  answer  comes  with  Theorem  2:  in  the  situation  where  Q,  only  contains  the 
operation  of  string  formation  and  set  formation,  the  interpretational  scheme  can 
play  the  SEime  role  as  the  prograjn  in  an  universal  computational  scheme. 

The  computational  protocol  that  we  describe  uses  schemes  to  perform  the  oper- 
ations (OPS)  of  generalization  (GEN),  encoding  (ENC),  specialization  (SPC),  and 
spelling  (SPL).  However  we  need  to  modify  their  definition  a  little  bit  and  allow 
OPS  to  perform  also  on  only  one  symbol,  or  one  substring  of  the  text.  Other  com- 
putational protocols  can  be  invented  using  alphabets  and  their  actions  in  a  different 
way.  The  one  we  propose  is  meant  to  simulate  exactly  the  behavior  of  a  general 
phrase  structure  grammax  (of  type  0).  It  is  still  an  open  problem  whether  Theorem 
2  is  valid  when  in  the  case  when  OPS  are  applied  exactly  the  way  they  were  intro- 
duced in  Definitions  9  and  10,  but  we  shouldn't  be  too  worried  and  miss  the  forest 
for  the  tree. 

Our  simulating  device  uses  two  schemes  A  and  B.  A  is  used  indifferently  and 
nondeterministically  to  perform  (GEN),  (ENC),  (SPC),  and  (SPL)  on  a  text  T.  B 
is  used  to  perform  only  (SPC)  and  (SPL).  We  select  a  subset  T  C  A{A)  U  A{B) 
of  terminal  symbols.  A  computation  is  the  performance  of  the  operations  (OPS) 
-with  the  restrictions  above-  on  a  given  text  T.  The  computation  is  in  a  final  state 
when  the  text  T  contains  only  terminal  symbols. 

Lemma  1  Let  G  he  a  general  phrase  structure  grammar.  Let  G'  the  grammar 
obtained  from  G  by  deleting  all  the  productions  of  the  kind  5  — >  f  where  s  and  t  are 
single  symbols  (terminal  or  variable)  and  by  substituting  t  for  s  in  each  production 
of  G.   Then  L{G)  =  L{G').  (See  Theorem  4.4  [46]) 
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Theorem  2  Give,  any  general  pkr.se  structure  .rammer  G  tkere  eM,  a  compu- 
n:<  sett^,  .  .e.cnW  a^.e  .HUH  ...„/«.,  t.e  ,eri..t,on  Hy  G  o,  ..r,s  ,n 
the  language  L{G). 

Proof  ■  By  Lemma  1  we  en  suppose  that  G  contains  no  productions  of  the  form 
.  ^  ,  where  I  and  t  are  distinct  singleton  symbols.  Let  all  the  productions  m  G  be 

(2.6) 

where  .„.  and  v„  are  words  on  the  set  of  symbols  of  G,l  <;<;=.  1  <  i  <  "■ 

Le,  A  =   {a.,...,a„,6„...>„}  "here  ^M  =  »,;    *(M  =  {"..^■>  "^  ^   = 
U         c  }  whLe  *(c<)  =   {*,,...,*,.}  and  ^(.,)  =  v.,,.     All  the  new  symbol 
In^odul  .,H.,ZU  L  distinct  fo.m  the  symbols  of  G.  Mo.eo,e. let  the  te.m.nal 
symbols  of  the  device  be  the  same  as  the  terminal  symbols  of  G. 

It  is  easy  to  see  that  for  each  use  of  a  production  rule  of  the  G  then  there  is  a 
sequence  of  OPS  in  the  order  (ENC  GEN  SPC  SPL)  which  achieves  the  same  result 
Moreover  if  the  device  starts  on  the  text  composed  of  the  only  start.ng  symbol  of 
the  grammar,  it  wiU  only  generate  words  of  L{G). 
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The  foUowing  example  shows  that  an  alphabet  used  as  a  computational  device 
can  sometime  provide  a  more  compact  definition  of  a  language. 

Example  20  Consider  the  context  sensitive  language  x^z^y^n  >  0.    It  can  be 
generated  by  the  grammar: 

S  -^  xS\n    S  -^xzti    5-^0 
^X-*  X^i      zX-^  zz     fi-*y 

It  is  iust  a  matter  of  a  Uttle  bit  of  work  to  see  that  we  need  O(n')  steps  to  get 


x'^z'^y'' 


Let  us  go  back  to  our  usual  mode  of  apphcations  of  OPS  as  in  Definitions  9  and 
10.  The  following  alphabet 

5^1xl2ly    Ix^lxx 

ly  ^  lyy      1^  -*  1" 
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obtains  la;"lz"ly"  in  0{n)  applications  of  SPL.  One  application  of  SPL  with  the 
following  alphabet  will  yield  a;"z"t/". 

Ix  — >  X 

Note  that  even  if  we  counted  the  spelling  of  each  symbol  as  one  operation,  still 
we  would  need  only  0{n)  of  them.  Of  course  the  same  could  be  achieved  with 
grajnmars  if  we  had  rules  organized  in  sets,  with  the  requirement  that  if  one  rule  is 
applied  then  aU  the  other  in  the  saxne  set  must  be  applied. 
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2.4      Open  Problems  and  Future  Work 
2.4.1      Multidimensional  Codes 

We  might  want  to  explore  if  it  is  possible  to  extend  this  ideas  in  order  to  apply 
them  also  the  the  analysis  and  learning  of  the  visual  image.  As  a  first  step  we  will 
consider  only  images  that  have  a  discrete  structure  but  are  not  simply  texts,  i.  e., 
totally  ordered  sets  of  symbols.  Many  mathematical  texts  caji  be  taken  as  examples 
of  such  images.  For  example: 


L 


s        A 


22 


For  some  reason  its  expression  in  Latex©does  not  look  as  expressive  due  to  its 
nested  paxentesis  and  functional  symbols: 


\C  \int_{\Sigma}   \frac-C(a_{l}--{\beta_-Cl}}+   a_{2>--C\beta_-C2»)KA_-C22}> 
dXsigma  \] 


The  following  definitions  are  the  first  step  of  an  attempt  in  the  direction  of 
treating  those  texts  without  having  to  impose  a  total  order  on  its  symbols. 
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*  •    o^o^^  (T  (l\  where  d  is  a  metric.  The  elements 
Definition  18  A  texture  is  a  metnc  space  (T,  d)  wnere  a 

of  T  are  called  positions. 

a 


I 


It  Mows  l,nmediatdy  th.t  a  texture  is  a  topological  space,  with  the  topology 
i.d  ce  by  the  metric  ..  The  topology  is  defined  in  the  usual  way  by  cons,  enng 
the  bl  centered  in  every  point  as  a  base  to.  the  neighbourhoods  .n  the  usual  way,. 

Hierarchical  textures  play  an  important  role  in  the  development  our  approach. 
Their  definition  is  based  on  the  notion  of  uUrameiric,  which  is  s.mply  a  metn 
?here  the  triangular  .ne.uaUty  is  vaHd  in  the  stronger  form  .(x,  .)  <  »a.(.(x.  v).  ^(v,  ^)' 

Definition  19  A  texture  (T,.)  is  Mera.ckica,  if  there  exists  an  ■J'--'^;^' ^^^ 
that  the  topology  induced  by  i'  on  T  is  the  same  as  the  topology  mduced  on  T  by 

d. 

D 

Definition  20  A  hypertext  over  an  alphabet  A  is  a  couple  (T,^)  where  T  is  a 
connected  (in  topological  terms)  texture  and  4>  a  function  <^  :  T  ^ 


.nnected  (in  topological  termsj  texture  anu  v'  ..  .^ -r  -  -        ^ 
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The  deSnition  of  hypertext  aUows  a  generaU.ation  of  all  the  concep  s  m  the  th 
previous  sections.  In  tact,  we  can  substitute  hypertexts  tor  texts  and  ^'""8-  't 
previous  sections:  all  the  concept  we  defined  would  stiU  make  sense  and  aU  the  fee 
we  proved  would  still  be  true.  Jus.  a  Uttle  bit  of  attention  .s  needed  to  take  care  of 
L  metric  in  the  texture  of  an  hypertext  when  we  paste  together  ifterent  p.eces. 
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Chapter  3 


Measurement 


Most  of  our  definitions  of  learning  have  a  precise  meajiing  at  this  point.  We  now 
assume  that  a  learning  observer  A  is  an  interpretational  scheme  (X,  X)  where  X  is 
all  the  symbols  that  A  has  previously  constructed.  A  is  an  interpretational  scheme, 
or  equivalently  cm  alphabet,  simply  means  that  A  will  interpret,  or  encode,  T  in 

— A 

terms  of  the  symbols  that  A  "knows" .  Let  us  do  that  and  consider  T  .  Whatever 
was  unknown  in  T  is  now  lost  (see  Remark  4)  and  what  is  left  are  only  symbols  in 
A.  So  we  can  consider  T  as  a  text  over  X  or,  equivalently,  A. 

The  goal  of  the  present  chapter  is  to  supply  our  learning  observer  with  a  means 
of  measurement  for  its  representation  of  the  reality  under  observation.  The  object 
to  be  represented  is  a  text  T.  We  wiU  assume  almost  no  reqtdrement  for  the 
measurement  function,  which  will  be  called  a  measure.  The  nature  of  the  measure, 
whether  it  depends  on  internal  parameters  or  it  is  given  from  an  external  agent, 
will  determine  if  learning  is  supervised  or  unsupervised. 

In  the  following  chapter  we  will  use  the  general  notion  of  measure  introduced 
to  construct  better  representations  of  the  text.  It  is  obvious  that  one  can  use  more 
than  one  measure  concurrently. 
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3.1      Efficiency  of  Encoding 

A  possible  measure  can  be  construed  form  the  foUowing  example  Assume  that  A 
hal  an  internal  representation  for  a  portion  of  T  in  a  ^ven  bounded  memory  spac. 
The  first  problem  that  arises  is  that  of  finding  a  smart  way  to  encode  the  text  T  m 
order  to  have  an  internal  representation  of  a  larger  portion.  One  way  to  do  hat  is 
to  construct  another  interpretational  scheme  (B,  A),  and  to  represent  T  In  th:s 
situation  the  desired  measure  is  the  size  of  the  encoding.  As  we  w.ll  see  m  the  next 
section  we  can  quantify  the  gain  obtained  in  this  way. 

First    in  order  to  understand  what  a  good  encoding  is,  we  need  to  introduce 
some  notational  conventions  and  to  borrow  a  theorem  from  Information  Theory. 

3.1.1      Definitions  and  Notational  Conventions 

Unless  otherwise  stated  T  will  be  a  text  ^  over  an  alphabet  A  =  {a^}U- 

1    A  measure  or  cost  \.\  is  a  function  from  the  set  B  of  all  the  texts  and  classes 
over  an  alphabet  to  the  real  numbers,  such  that,  for  each  a,  6^5, 

a)  i>{a)  C  i>{b)  ^  \a\  <  \b\ 

b)  \aUb\<  \a\  +  \b\ 

^  is  as  in  Definition  4.  The  relation  C  in  a)  is  overloaded  to  mean  "substring" 
or  "subset",  whatever  applies.  No  relation  is  assumed  between  a  stnng  and 
a  set.  Analogously  in  6)  U  is  overloaded  and  denotes  string  concatenation  or 
set  union. 
An  example  of  measure  is  the  length  for  texts  and  the  cardinality  for  classes. 

2.  T„  is  the  prefix  subtext  of  T  such  that  \T,\  =  W.    If  AT  >  \T\  we  define 

-    ■Thelf.uU,..u„p.i,..is.K..ter.s  „.  «.,...  Most  of  the  -""'  >»  ™<>-  "^  "'  ■"""'»"' 
J,„  W  infinite  ...ts  8i..n  .pp.op.i.t.  c„.  »d  th.  <^  otbm.l  .per.l.on,. 
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3.  #°(T)  is  the  text  obtained  by  deleting  all  symbols  but  the  symbol  a  from  T. 
Note  that  if  the  measure  is  length,  then  |#''(T)|  is  the  number  of  occurrences 
of  the  symbol  a  in  T. 

4.  A  probability  p^  is  defined  on  A  through  T  by  means  of  the  formula 

For  finite  texts  the  expression  above  does  not  cause  any  problem.  We  assume 
that  infinite  texts  are  nice  and  this  Hmit  always  exists. 

In  general  a  probabiHty  is  just  a  function  that  is  defined  in  term  of  the  cost. 
However,  if  the  measure  is  length  then  the  induced  probabiUty  measures  the 
relative  frequencies  in  T  of  the  symbols  in  A. 

5.  If  C  is  a  classification  over  A,  we  can  also  consider  the  conditional  probability 

[0  if  a  0  c 

I     P^    (c) 

6.  An  n-ary  code  is  a  code  over  an  alphabet  with  n  symbols. 

5 

7.  Let  S  be  an  unambiguous  code  over  A  and  T    a  text  over  S.  S  is  optimal  for 

5 

T    if  the  expected  cost  of  one  symbol  seS, 


{sy  =j:p^  {s)ms)\ 

is  minimal.  Note  that  properties  a)  and  6)  in  1.  imply  that  when  S  is  optimal 
for  a  text  T'  the  spelling  T5  has  minimal  cost. 

8.  Let  C  be  a  classification  on  A.  The  expected  cost  of  C  is 

9.  OPT^{T)  is  the  cost  of  T^  where  S  is  an  optimal  code  over  an  alphabet  with 
n  symbols. 
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10.  The  compression  coefficxent  \r[A)  of  an  alphabet  A  for  a  text  T  is 


,-=:A 


\r{A)  =  ^h_m    |c>pr"(T;v)| 


11.  The  n-ary  in/orma^ion,  or  entrow  of  a  text  T  over  an  alphabet  A  is  the 

expression 

HT)  =  -k 

atA 


i(r)  =  -fc^P^(a)logP^(") 

atA 

where  Jk  =  1/logn.  AU  the  logarithms  are  in  base  e. 
12.  The  n-ary  information,  or  entropy,  of  a  class  c  of  elements  in  A  is 


aeA 


Theorem  3   T/..  expected  cost  of  a  symbol  in  T  under  an  opt^mal  n-ary  code  :s 
equal  to  the  n-ary  information  of  J. 
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The  previous  fundameu.al  theorem  of  information  theory  shows  that  the  express.on 
for  in/ormafon  is  actnaUy  a  quantification  of  the  same  commonly  used  concept. 
However,  it  is  often  more  intuitively  rewarding  if  it  is  taken  to  quantify  the  mfor- 
mation  that  can  be  contained  in  a  text. 

We  have  formulated  this  theorem  in  its  generalised  form,  where  the  cost  function 
does  not  have  to  be  the  length.  The  same  proof  as  in  (49),  is  stiU  valid.  mu,a,„ 
mutandis,  in  this  more  general  situation. 

3.1.2     Block  Coding 

One  way  to  optimize  the  encoding  process  is  hlock  codrn,  a  well  known  technique  of 
Information  Theory.  The  idea  is  very  simple:  instead  of  encoding  each  symbol  by 
itself,  one  encodes  blocks  of  symbols.  In  other  words,  before  encoding,  we  choose 
a  code  5  over  X  and  consider  T^  the  interpretation  of  T  by  5.    Let  us  denote 
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with  (5)       the  expected  cost  of  the  encoding  of  one  symbol  of  S  in  T.   Then  the 

5 

expected  cost  of  T^  is 


N' 


N 


{Sf 


(3.1) 


By  Theorem  3  we  know  that  the  expected  cost  of  an  optimal  encoding  of  T    is 

OPT^CTn^)  =  N'  ■  i{r^)  (3.2) 

(Let  n  be  the  number  of  symbols  of  the  alphabet  of  T). 
Putting  3.1  and  3.2  together  we  have 

(3.3) 


OPT''{t/)  =  N^-^^ 


{sy 

Since  OPT"{Tn)  =  N  then  we  have  proved  the  following 

Proposition  4  Let  S  and  T  be  a  code  and  a  text  over  an  alphabet  A  respectively. 
The  compression  coefficient  of  S  for  T  is 

MS)  =  ^ 

{Sf 


Example  21  Let  us  consider  the  Morse  code  ^A  as  in  Example  13.  Let  T  be 
English  encoded  with  the  Morse  code  (in  our  terminology  we  should  say  spelled 
with  the  Morse  code).  Then 


space 


0.2875 


0.4297 


0.2868 


.,7==M. 


7^M 


We  have  {{T)  =  1.0785;  i{T^)  =  2.8902;  {Mf     =  3.5361.  FinaUy 

Xt{M)  =  0.8173 


D 
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As  we  have  seen  in  the  introduction  the  expression  there  xs  structure  in  a  text, 
means  that  the  text  can  be  compressed.  The  foUowing  example  is  meant  to  show 
that  a  text  is  structured  whenever  there  is  a  string  of  symbols,  of  any  length,  which 
never  appears  in  the  text. 

Example  22  Let  T  be  a  text  over  an  alphabet  A,  \A\  =  n,  and  S  an  unambiguous 
code  over  A,  \S\  =  m.  Let  us  suppose  that_no  information  is  available  about  the 
probabihties  p^A)  and  p^S)  but  that  (5)^  =  l  Since  we  have  no  information 
about  the  probabihty,  we  take  it  to  be  the  uniform  distribution.  In  this  situation 
Xr{S)  is  at  most  '-^  since  the  information  is  maximal  when  p  is  the  umform 
distribution  [49].  By  the  same  token,  Xr{A)  =  logn.  So  whenever  n'  >  m  it  is 
convenient  to  use  5.  If  all  the  strings  of  S  have  the  same  length  /,  then  the  text  is 
compressible  when  not  all  the  strings  of  length  /  are  represented  in  the  text. 
D 

Example  23   The  Discovery  of  Phrase  Structure 

The  coefficient  A  can  be  employed  in  many  different  ways  to  the  study  of  lan- 
guage. In  this  section  we  wiU  use  it  to  evaluate  the  performance  of  algorithm  N  in 
[10]  given  in  Figure  3.1. 

We  considered  a  text  consisting  of  100  words  randomly  generated  using  the 
natural  language  grammar  given  in  Section  1.4.4.  If  we  run  twice  the  algorithm 
N  in  [10]  on  such  a  set  of  sentences  (which  correspond  to  finding  two  hierarchical 
levels  of  codes)  we  get  the  code 

(to  V)         {P  T  N)      [T  AD  J  N)      (P  N)  ' 
(P  ADJ  N)    {toDV)        {ADJ  N)  {D)      ■ 

(N)  {TN)      {PTADJN)     {TV)    J 

In  the  foUowing  table,  we  can  compare  the  A  value  of  the  alphabet  A  of  terminal 
symbols,  5,  and  5,  consisting  of  aU  the  digrams  appearing  in  the  text  and  aU  the 
word-ending  monograms  (to  guarantee  parsing  of  sentences  with  an  odd  number  of 
components). 


s  =  I 


A 

S 

S2 

A 

2.2701 

1.676 

2.183 
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Algorithm  N 

iv  =  0,r  =  0 

Forall  sentences  w 

Consider  the  terminal  symbol  of  lo,  say  <; 

a)  if  <  G   r  then  factorize^  w  in  strings  ending  (and  not  containing) 

elements  of  T,  call  the  set  of  these  strings  V  and  set  iV  =  iV  U  {V}. 

b)  lit^T  then  factorize  the  elements  of  N  in  strings  ending  with  and 

not  containing  i;  further,  set  T  =  T  U  {<},  factorize  w  in  strings 
ending  with  (and  not  containing)  elements  of  T  and  add  them  to  N . 

end  forall 

Figure  3.1:  Algorithm  N 

As  we  see  there  is  a  sensible  difference  in  the  gain  obtained  using  Algorithm 
N.  Moreover,  if  we  notice  that  the  grammar  chosen  is  one  which  describes  (even  if 
incompletely)  the  word  classes  phrase  structure  of  EngUsh,  we  find  that  the  code- 
words discovered  by  Algorithm  N  coincide  with  the  ones  which  a  person  (not  only 
a  linguist)  would  normally  consider  to  be  the  elementary  strings  [42]  of  the  phrase 
structure. 

This  results  sustains  the  conjecture  that  guided  our  work  on  the  subject:  A 
reaches  a  local  minimum  over  codes  which  could  be  and  have  been  discovered  by 
means  of  other  "reasonable"  considerations.  The  process  which  guides  language  use 
is  the  same  at  different  levels  and  linguistic  ability  is  the  superimposition  of  the 
same  operations  on  different  substrata. 


12 


For  example,  abcaadbcd  factorized  by  T  =  {c,  d}  yields  V  =  {abc,  aad,  be,  d} 
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3.1.3      Class  Coding 

While  there  is  no  compression  gain  for  a  text  T  over  an  alphabet  A  when  using  only 
a  classification  C  over  A,  there  can  be  a  gain  when  the^assification  .  assoaated 
toacode.Let5  =  5uC_anaIphabetoverA.Letr'^r     and  5' =  5   .  It  .s  easy 

to  see  that  {sf  =  {S'f^  ,  so  the  expected  cost  of  T^     is  as  m  3.3. 

Once  again,  by  Theorexn_3^the  lower  bound  for  the  expected  cost  of  an  optimal 

encoding  of  one  symbol  of  Tn     is 

.      N'-i{T^)  =  N'-z{T''')  (3-4) 


wT^B, 


Note  that  in  general  i(T")  <  i(7')-  However  some  information  .s  lost.  .(T  ) 
i.  not  sufficient  to  reconstruct  T  given  T^  because  we  are  missing  the  Informat.on 
regarding  what  element  of  V'  is  to  be  considered.  Remember,  in  fact,  that  there 
can  be  more  than  one  speciahzation  under  a  dassification.  W.  will  now  quant* 
the  extra  information  needed. 

Theorem  1  shows  that  there  is  an  equivalence  between  C'  and  5^,  so  we  can 
indifferently  consider  any  of  the  two.  Let  Ie5^,^(l)  =  c.-C  The  additiona^ 
information  to  determine  which  symbol  c,  stands  for  is  given  by  z(c.).  Its  expected 
value  is  -c  7==c  ,o  c\ 

ceC 

Then  we  have  to  multiply  that  value  for  {sf ,  the  expec^ted  cost  of  the  codeword 
^{s).  We  finally  get  that  the  information  per  symbol  of  T     is 

i(r^)  +  l^sf  {Cf  (3.6) 


We 
ofT^ 


only  need  to  multiply  it  by  N'  to  get  the  expected  cost  of  an  optimal  encoding 


0PT-{T^)  =  N'  (z(r^)  +  {Sf'  (cf)  ^^-^^ 

Now,  using  3.1  we  get 


OPr"(T^)  =  ^i(I)  +  (^)^(^)^ 


(S) 


t' 


53 


(3.8) 


We  have  proved  the  following: 


Theorem  4  Let  B  =  S  U  C  and  T  be  an  alphabet  and  a  text  over  an  alphabet  A. 
The  compression  coefficient  of  B  for  T  is 

\r{B)  =  -^  +  {Cf 


The  reader  can  get  an  intuition  about  how  the  classification  allows  a  further  gain 
in  compression  by  considering  that  it  causes  a  decrease  in  the  number  of  symbols  of 
the  alphabet  that  is  being  constructed.  When  the  number  of  symbols  decreases,  the 
entropy  function  "usucdly"  decreases.  Note  that  Proposition  4  is  a  particular  case 
of  Theorem  4  when  the  classification  is  the  trivial  one  where  each  class  contains 
only  one  symbol. 

Example  24  Let  us  consider  the  Morse  code  as  in  Example  21  with  a  slight  vari- 
ation. Let  us  suppose  the  that  the  source  and  the  receptor  of  the  Morse  signal  are 
not  well  trigged  so  that  the  receptor  sees  the  following  signals:  —  +  •  o  space, 
even  if  the  source  is  transmitting  only  —  ,  •,  space.  When  the  source  transmits  a  — 
the  receptors  gets  either  a  —  or  a  +  with  same  probabiUty.  Analogously,  when  the 
source  transmits  a  •  the  receptors  gets  either  a  •  or  a  o  with  same  probability.  The 
space  is  received  correctly. 

Let  us  consider  an  English  text  T  and  let  T  be  a  possible  way  T  looks  to  the 
receptor.  T  is  a  text  over  the  alphabet  M=  {—,+,-,o,  space}.  If  the  English  text 
T  is  long  enough,  we  have 


0.1431 


+ 
0.1431 


0.2140 


0.2140 


space 
0.2857 
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We  have  ,^  qx 

i{T)  =  At(M)  =  1.5743  ^^'^) 

It  is  obvious  that  if  C  is  the  classification  ((-,+},{-, o}, {.pace}}  then  T^  is 
the  same  as  the  spelling  T^.  Consider  the  conapletion  of  M  by  C  and  call  .t  M  , 
M'  ^'^M.  There  are  276  elements  in  M'. 

i(T^')  =  4.6481  and  the  compression  coefficient  Ax(M')  =  1.3247.  Note  that  it 
is  convenient  to  use  M'  because  Ax(M')  <  A(M)  in  expression  3.9. 

-IP 

Now,  the  alphabet  £  =  C  U  M'  acts  on  T  as  the  enghsh  alphabet  so  T  =  T 
,T  is  the  enghsh  text  we  started  with).  We  can  conapnte  {Cf     with  the  formula 

3.5.  We  get  . 

(C)^    =  0.4944 

.  On  the  other  end  from  Example   21  we  recall 

jil-L  =  0.8174 

(note  that  {M')^   =  {Mf    )■ 

^,  .  .  1    +  \    /■  r^^  _  1  -11 1 «    The  use  of  the  classification  C 

It  follows  from  Theorem  4  that  Xt{E}  -  l.^liS-   ^ne  use 

allows  a  further  gain  over  the  use  of  the  code  M'. 

3.2      The  Objective  Viewpoint 

Until  now  we  have  embraced  a  suhjectrve  viewpoint:  no  assumption  were  made  on 
the  text  under  observation.  In  this  section  we  wiU  take  a  different  standpoint  and 
assume  that  the  text  is  meant  for  communication  ends,  as  in  the  case  where  T  xs  an 
utterance  in  a  natural  language.  In  this  new  situation  it  is  assumed  that  the  text  :s 
constructed  in  such  a  way  to  exhibit  properties  of  optimahty  and  that  our  learmng 
system  attempts  to  adjust  to  that  external  optimal  shape. 

In  a  less  general  setting,  we  will  be  able  to  see  that  constructing  an  interpre- 
tational  scheme  can  be  viewed  as  an  evolution  toward  equihbrium  of  a  physical 
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system.  In  this  way  we  will  find  an  intuition  that  the  expressions  that  we  have 
found  are  related  in  many  ways  to  the  expressions  of  physical  concepts  as  energy, 
entropy,  temperature. 

3.2.1      The  Boltzman-Gibbs  Distribution 

The  interpretational  scheme  B  that  we  will  consider  now  is  simply  a  code.  We 
choose  to  do  this  because  the  general  case,  when  B  =  S  U  C  and  C  is  a  nontrivial 
classification,  requires  more  care  and  will  resvdt  in  a  less  fluent  exposition. 

Let  us  consider  the  set  «S  of  all  the  unambiguous  codes  over  an  alphabet  A.  A 
text  T  over  A  can  be  thought  of  as  a  partial  function 

r    :  S  — ^         V 

S  =  {sk}   — '   q  =  {qk} 

where  qk  is  the  frequency  of  the  symbol  Sk  in  the  text.  T{S)  can  be  undefined  if  S 
does  not  parse  T.  See  also  Remark  4.  It  is  easy  to  observe  that  the  condition  of 
unique  decipherabiHty  of  S  implies  that  T  is  injective.  It  should  be  noted  that  this 
condition  could  be  weakened  by  making  assumptions  on  the  parsing  procedure  for 
the  codewords,  instead  of  the  nature  of  the  code.  If  the  conventions  for  parsing  are 
"good",  for  instance  it  favors  "the  longer  the  better"  principle,  then  T  will  still  be 
injective. 

In  the  previous  section  we  have  found  a  way  to  evaluate  the  choice  of  the  code 
SeS.  In  the  present  situation  we  caji  reason  in  a  different  way,  since  we  have  assumed 
that  the  text  T  is  meant  for  communication  purposes. 

We  invoke  the  formula 

\sk\  =  — Y~  (^-^^^ 

where  /3  =  logn.  This  formula  relates  the  cost  \sk\  of  transmitting  signal  5^ 
in  some  optimal  n-ary  code  and  its  probability  of  occurrence  pk-  For  the  sake  of 
our  reasoning  we  need  to  mention  that  besides  its  information  theoretical  deriva- 
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tion  based  on  the  optimaUty  condition  (see  Theorem  3)^  the  same  formula  has 
been  mathematicaUy  estabhshed  by  [61]  in  other  ways  foUowing  diachronic  and 
synchronic  considerations  on  word  formation. 

For  aU  symbols  Sk  let  V(«fc)  =  4-^ik  ^^  ^^^  ^^^^^^  ^°^*  '''  °^  *^^  ^*"^^  ^'' 
computed  with  the  formula 

to  its  assigned  cost  \sk\  by  means  of  the  relation 

\s,\  =  h  +  L  (3.11) 

L  is  needed  as  a  normalizing  factor  as  we  will  see  in  a  moment. 

If  cost  is  simply  length,  as  was  suggested  in  [61],  the  expression  above  is  justified 
by  the  consideration  that  cost  is  best  interpreted  as  the  time  k  necessary  to  "read" 
(in  a  generaUzed  sense)  the  string  plus  the  time  L  for  recognizing  its  end  which  in 
our  case,  as  opposed  to  the  one  in  [61],  is  not  marked  by  any  special  symbol. 

From   3.10,  3.11,  and  the  condition  EPfe  =  1  ^^  g^t 

Pk  =  -^  (3.12) 

where  Z  =  e^^  =  Ee"'"*-  3.12  is  the  Boltzmann-Gibbs  distribution  where  the 
length  of  a  codeword  stands  for  its  energy.  The  derivation  above  should  convince 
us  that  we  can  expect  the  codewords  to  be  distributed  according  to  the  Boltzmann- 
Gibbs  distribution  if  we  consider  that  once  they  are  estabhshed  they  can  be  used 
as  an  alternative  alphabet  (as  an  example  of  real  language  consider  Chinese)  and 
that  the  choice  in  their  use  is  not  given  to  the  language  user. 

We  can  now  define  the  map 

g    :         S         —*  V 

3This  formula  is  not  a  consequence  of  Theorem  3  but  it  is  consistent  with  it  in  a  very  particular 
way:  is  constitutes  the  "best"  (as  far  as  basic  information  theory  is  concerned)  interpretation  of 
Theorem  3 
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and  we  can  use  Kullback  cross- entropy  [53] 

K{qyP)  =  ^qlog- 

which  measures  the  "information  for  discrimination"  between  the  distributions  q 
and  p,  to  define  a  measure  of  how  close  a  code  S  is  to  an  optimal  code. 

Our  learning  mechanism  will  minimize  this  distance  by  minimizing  the  function 
K{T{S),g{S)). 

The  intuitive  explanation  relies  on  the  fact  that  Kullback  cross-entropy  can 
be  thought  of  as  directed  distance  (in  fact,  besides  as  relative  entropy  it  is  cdso 
known  as  directed  divergence)  from  the  distribution  q  to  p.  We  should  mention  that 
the  principle  of  cross-entropy  minimization  is  a  generalization  of  Jaynes  principle 
of  maximum  entropy  [48]  which  can  be  applied  when  a  prior  distribution  p  that 
estimates  the  distribution  q  is  given  in  addition  to  the  constraints.  A  detailed 
study  ajid  a  bibliography  on  its  appUcations  which  range  from  statistics  to  pattern 
recognition  and  spectral  analysis  can  be  found  in  [79,80]. 

Example  25  Let  us  consider  agaiin  the  Morse  code  M  as  in  Example  21.  If  T  is 
any  long  enough  engUsh  text  the  vector  T{Ai)  is 


a 

b 

c 

d 

e 

f 

g 

h 

p 

0.0778 

0.0128 

0.0298 

0.0417 

0.1070 

0.0251 

0.0179 

0.0576 

i 

J 

k 

1 

m 

n 

o 

P 

p 

0.0750 

0.0059 

0.0094 

0.0384 

0.0290 

0.0715 

0.0718 

0.0179 

q 

r 

s 

t 

u 

V 

w 

X 

p 

0.0053 

0.0561 

0.0724 

0.0820 

0.0317 

0.0169 

0.0202 

0.0049 

7 

z 

p 

0.0196 

0.0023 

^ 

The  vector  Q{M.)  is 


0.25 


0.15 


0.05- 


Figure  3.2:  T{M)  solid  line;  Q{M):  dotted  line. 


a 

b 

c 

d 

e 

f 

g 

h 

p 

0.0679 

0.0146 

0.0101 

0.0315 

0.2112 

0.0146 

0.0218 

0.0210 

i 

J 

k 

1 

m 

n 

o 

P 

p 

0.0979 

0.0070 

0.0218 

0.0146 

0.0471 

0.0679 

0.0152 

0.0101 

q 

r 

s 

t 

u 

V 

w 

X 

p 

0.0070 

0.0315 

0.0454 

0.1465 

0.0315 

0.0146 

0.0218 

0.0101 

y 

z 

p 

0.0070 

0.0101 

The  cross  entropy  between  the  two  is  K{T{M),Q{M))  =  0.1993. 

In  Figure  3.2  we  can  compare  the  values  of  T{Ai)  against  the  ones  of  Q{/A). 


a 
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3.2.2      Information  Mechanics 

"Computing  processes  are  ultimately  abstractions  of  physical  processes: 
thus  a  comprehensive  theory  of  computation  must  reflect  in  a  styUzed 
way  aspects  of  the  underlying  physical  world.  On  the  other  hand,  physics 
itself  may  draw  fresh  insights  and  productive  methodological  tools  from 
looking  at  the  world  as  an  ongoing  computation.  The  term  information 
mechanics  seems  appropriate  for  this  unified  approach  to  physics  and 
computation."  [82] 

Let  us  try  to  consider  the  function  that  we  want  to  minimize  in  a  different  light. 
K  we  multiply  K(T{S),  G{S))  by  the  constant  T  —  1/ ^  we  can  write  it  in  the  form 

TK{r{s)MS))  =  TY.<i'^ogq  +  Y.<i\A  =  M  -TS 

If  we  interpret  cost  as  energy,  we  can  see  that  there  is  a  consistency  between  the  two 
concepts  at  the  microscopic  and  the  macroscopic  level  and  that  our  function  takes 
in  fact  the  familiar  form  of  free  energy  (we  have  overloaded  the  symbol  S  which 
on  the  right  side  of  the  equality  stands  for  entropy  density).  This  outcome  is  not 
surprising  and  it  depends  on  the  way  we  have  constructed  our  reasoning.  However  in 
the  present  situation  we  know  what  "temperature"  stands  for,  and  we  can  attempt 
an  interpretation  at  the  microscopic  level  of  the  process  of  minimization  as  a  process 
of  thermalization. 

If  we  consider  T  as  a  function  of  the  number  of  symbols  n  we  see  that  negative 
temperatures  correspond  to  a  number  smaller  than  1  of  symbols,  situation  quite 
difficult  to  imagine  in  both  directions  (remember  that  T  is  the  absolute  tempera- 
ture). The  interesting  case  T  >  0  shows  that  smaller  temperatures  correspond  to  a 
higher  number  of  symbols.  The  process  of  cooling  down  the  system  in  order  to  find 
the  minimum  of  the  cross-entropy  function  corresponds,  at  the  microscopic  level, 
to  allowing  the  possibility  to  use  more  and  more  symbols  and  hence  to  distinguish 
more  and  more  signals.  We  should  notice  that  even  if  we  let  the  temperature  go  to 
zero,  we  can  reasonably  expect  the  system  to  freeze  on  a  configuration  with  a  finite, 
relatively  small,  number  of  symbols:  when  the  temperature  is  small  it  is  unlikely  to 
deviate  too  much  from  an  already  good  configuration. 
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3.3      A  Learning  Rule  for  Neural  Nets 

The  ideas  we  have  considered  so  far  can  find  an  application  to  the  study  of  neural 
nets.  In  this  section  we  will  propose  an  architecture  which  permits  a  representation 
of  the  subsequent  encodings  of  a  text  as  the  transmission  of  a  sensory  signal  through 
a  hierarchical  tissue.  The  learning  Rule  that  we  propose  is  derived  by  the  findings 
in  Section  3.2.  A  generalization  in  the  flavor  of  Section  3.1  can  have  an  analogous 
development. 

3.3.1      The  Logical  Neural  Tissue 

The  neural  architecture  described  in  this  section  is  inspired  by  the  optimizing  prin- 
ciples derived  earlier  in  this  chapter  and  is  grounded  on  the  neurological  fact  that 
the  reaction  time  to  a  sensory  input  is  roughly  equal  to  the  time  necessary  for  the 
signal  to  travel  through  the  neural  system.  We  propose  the  intuition  that  the  elabo- 
ration of  the  input  takes  place  during  the  transmission.  As  a  first  approximation  we 
consider  an  architecture  consisting  of  a  hierarchy  of  layers  of  logical  neurons.  Each 
neuron  transmits  its  excitation  to  neurons  in  the  level  above  and  receives  excitation 
from  the  level  below. 

The  units  at  level  Lq  are  the  sensory  units.  We  assume  the  net  is  a  priori  able 
to  detect  a  set  of  features  of  the  input.  For  each  feature  there  exists  a  population 
of  units  which  is  triggered  by  its  presence  in  the  window. 

Which  one  is  the  sensory  level  is  obviously  conventional.  At  each  level,  in  fact, 
each  unit  is  responding  to  higher  level  features  formed  by  the  concomitant  presence 
of  lower  level  ones.  Hence  at  any  level  Li  the  units  can  be  considered  as  functional 
receptors.  It  is  just  a  matter  of  conventions,  depending  on  what  level  of  features  in 
the  sensory  input  the  units  are  assumed  to  be  able  to  respond  to. 

3.3.2      The  Learning  Rule 

The  transition  of  the  signal  from  one  layer  to  the  succeeding  one  corresponds  to  one 
level  of  encoding.  The  learning  rule  that  we  will  now  describe  operates  unchanged 
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at  all  levels  in  the  hierarchy.  We  describe  it  in  general  by  explaining  how  to  set  the 
weights  from  the  units  at  level  L,  to  the  units  at  level  £j+i. 

In  the  following  we  assume  that  the  threshold  of  every  unit  is  1.  Let  us  consider 
every  unit  5  as  a  signal,  and  assign  to  it  an  "energy"  £{s)  equal  to  the  information 
theoretical  cost  of  transmitting  the  signal  in  an  optimal  n— ary  code 

.(.)  =  -!2i^  (3.13) 

where  0  =  log^n  and  p{u)  is  the  probability  of  firing  for  unit  u  computed  as  nor- 
malized relative  frequency  with  respect  to  the  other  units  in  the  same  level. 

To  every  unit  t  in  level  L^+i  is  assigned  an  energy  equal  to 

e{t)  =  J2^'A^)  (3.14) 

seLi 

where  w,^t  is  the  weight  of  the  connection  of  s  to  t.  The  intuitive  justification  of 
(3.14)  is  obvious:  every  unit  is  assigned  an  energy  equal  to  the  weighted  sum  of  the 
energies  of  the  units  connected  to  it. 

We  enforce  now  the  intuition  that  each  layer  can  be  considered  as  the  sensory 
layer,  so  we  want  the  neurons  in  layer  Li+i  to  perform  as  symbols  in  an  optimal 
code,  so  the  learning  process  should  set  the  weights  w,^t  in  such  a  way  that 

p{t)  =  — ^  (3.15) 

Vf,  where  Z^Ee-'^'^*). 

This  rule  is  also  justified  by  the  following  argument.  The  firing  of  each  unit 
at  level  Li  carries  an  energy  that  is  transmitted  by  the  units  connected  to  it  at 
level  ij+i.  We  want  the  unit  at  level  Li+i  behave,  energetically,  as  a  system  in 
equilibrium  with  that  at  level  Li.  This  approach  is  founded  on  the  assumption 
that  any  optimal  semiotic  system  has  to  satisfy  this  requirement,  as  derived  in  the 
previous  Section  3.2. 

After  the  net  has  been  trained  on  a  particular  homogeneous  reality,  it  will  show 
recognition  of  other  instances  of  the  same  training  set  by  being  able  to  transmit  the 
signals  from  the  sensory  units  to  higher  levels  of  the  hierarchy.  Activations  in  high 
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level  layers  correspond  to  a  high  level  description  of  the  input.    A  sample  from  a 
radically  different  reality  will  restdt  in  Httle  or  no  transmission  of  the  signal. 

3.3.3      Further  Developments 

One  way  to  implement  the  learning  rule  of  Section  3.3.2  to  perform  a  simulated 
annealing  algorithm  [50]  in  order  to  minimize  a  suitable  error  function  (for  example 
p(^)  _  «"^''''  ).  To  speed  up  the  process  we  can  make  the  weight  matrix  sparse, 
that  is  each  single  unit  is  connected  only  to  a  small  number  n  of  other  units  chosen 
at  random. 

There  are  many  assignments  to  the  weight  matrix  which  satisfy  condition  (3.15). 
A  priori,  all  of  the  assignments  are  equally  adequate  and  they  correspond  to  different 
Weltanschauungen.  The  possibiUty  of  comprising  many  different  ones  depends  on 
the  number  of  elements  in  the  poptdation  detecting  a  particulcir  feature  and  the 
number  of  connections.  This  explains  the  intuition  that  the  higher  the  number  of 
units  8md  connections,  the  more  versatile  is  the  net. 

In  many  fields  of  application,  like  natural  language,  the  interesting  feature  of  the 
reality  under  consideration  is  just  the  temporal  correlation  of  the  images  present 
on  the  window  at  different  instants  of  time.  In  such  applications  it  is  advisable  to 
provide,  at  every  level  of  the  net,  some  units  with  a  delayed  reaction.  Such  units 
will  provide  a  sort  of  short  term  memory. 

Relying  on  the  global  situation  for  determining  each  single  weight  is  the  major 
limitation  of  our  learning  rule  and  we  can  easily  expect  the  same  slow  convergence 
as  in  the  Boltzman  machine.  Moreover  our  learning  rule  rehes  on  the  availability 
of  the  information  about  the  frequency  of  firing  of  every  unit.  In  some  way  yet 
another  global  mechanism,  exterior  to  the  net,  has  to  provide  that  information. 
These  limitations  nullify  any  advantage  of  a  neural  net  implementation  over  the 
algorithmic  implementation  of  the  process.  Thus  it  is  advisable  to  focus  on  the 
search  for  a  local  learning  rule  with  the  same  desired  globed  behavior  cind/or  a 
distributed  mechanism  to  keep  the  information  regarding  the  relative  frequency  of 
every  neuron  updated. 
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The  ideas  exposed  in  3.3.2  can  aJso  be  exploited  for  the  study  of  any  general 
neural  net.  In  fact,  given  any  neural  net  with  fixed  weights,  if  we  assign  to  each 
neuron  of  a  net  a  symbol  -that  is  we  adopt  the  grandmother  neuron  viewpoint- 
the  history  of  the  net  can  be  interpreted  as  a  sequence  of  encodings  of  the  sensory 
input.  If  we  want  those  encodings  to  be  optimal,  then  the  net  has  to  satisfy  condition 
(3.15).  It  would  be  intersting  to  check  how  many  of  the  learning  rules  present  in 
the  literature  fix  the  weights  so  as  to  satisfy  condition  (3.15). 


Chapter  4 
Algorithms 


Once  a  text  has  been  inputed,  and  interpreted,  we  can  change  the  level  of  perception 
and  face  the  same  problem  again.  It  is  immediately  evident  that  the  process  of 
constructing  an  interpretational  scheme  to  obtain  a  larger  representation  of  7  can 
be  iterated.  We  can  think  of  an  hypothetical  learning  machine  which  constructs 
layers  of  interpretational  schemes. 

The  object  of  the  present  chapter  is  to  find  a  a  global  picture  for  this  process, 
by  describing  some  algorithms  that  concretize  the  abstract  learning  machine  that 
we  have  outlined  in  Section  2.2.5. 

4.1      Minimal  Description 

We  have  the  alphabet  as  a  universal  model  for  structures,  and  its  compression 
coefficient  as  a  measure  of  its  effectiveness  on  a  given  text.  We  are  ready  to  describe 
our  learning  algorithms.  As  we  have  promised,  our  goal  is  to  achieve  an  efficient, 
if  not  minimal,  description  of  the  text  to  be  learned.  So  we  need  to  introduce  the 
following 

Definition  21   Let  B  be  an  alphabet.  The  iizt  of  B  is  1^1  =  E.«B  IV'l^)! 
The  maximum  size  of  B  is  |B|^„  =  max„B  \^{s)\ 
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Definition  22  Let  T  be  a  text  over  an  alphabet  A.   A  description  of  T  is  a  pair 

p 

(5,  T')  where  5  is  an  alphabet  over  A  and  T'  is  a  text  over  B  such  that  T     =  T'. 


The  size  of  the  description  [B,T')  is 

\{B,r)\  =  \B\  +  \T\XriB) 


a 


Remark  6  Note  that  if  the  text  is  much  bigger  than  the  alphabet,  then  the  only 
significcint  contribution  in  the  size  of  the  description  is  given  by  the  second  term 
\T\Xt{B). 

n 


4,2      Enumeration  Algorithms 

All  the  algorithms  in  this  section  take  a  text  as  input  and  return  the  topmost  al- 
phabet that  they  have  constructed.  They  will  represent  different  stages  of  evolution 
of  the  trivial  enumeration  algorithm,  the  universal  program  that  solves  everything. 

We  want  to  model  learning,  evolution  in  time,  but  still  we  want  the  algorithms 
to  be  algorithms,  i.e.  engines  that  consume  input  and  produce  output.  So  we 
make  the  following  assumption:  the  algorithms  take  a  text  as  input  and  produce 
an  alphabet  as  output  on-line  right  away.  Then,  they  will  start  working  and,  little 
by  little,  update  their  output.  The  first  step  in  this  abstraction  will  be  to  assume 
that  all  of  the  text  is  inputed  in  one  single  operation.  Later  on  we  will  reconsider 
this  assumption  and  find  a  way  to  render  the  process  more  feasible. 

4.2.1      Universal  Enumeration 

It  is  not  difficult  to  see  that  there  exists  an  effective  enumeration  of  all  the  alphabets 
over  a  given  alphabet,  the  reader  can  construct  it  easily  by  himself.  The  only 
things  to  be  treated  with  a  little  bit  of  care  are  the  tests  for  completeness  and 


66 


Algorithm  1 

Input:  a  text  T  on  an  alphabet  Ao 
Current  output:  an  alphabet  Ai  over  Ao 

Let  A\  =  Ao 

forall  alphabets  B  over  Ao 

if|(5,r^)l<|(Ai,r^')lthenAi  =  B. 

end  forall 


Figure  4.1:  Universal  Enumeration 

unambiguousness.  Completeness  of  an  alphabet  is  easily  decided  by  brute  force 
since  it  involves  only  a  finite  number  of  checks.  However,  the  brute  force  test  for 
unique  decipherabiUty  would  involve  infinitely  many  trials.  LucHly,  there  exist 
decidable  necessary  and  sufficient  conditions  for  unique  decipherabihty  [16]. 

Given  aU  this,  it  is  easy  to  devise  an  algorithm  for  learning  that  uses  the  size  of 
the  description  as  a  heuristics.  The  algorithm  is  given  in  Figure  4.1 

There  could  be  very  iU-natured  infinite  texts  for  which  this  algorithm  never 
stabiUzes.  However,  at  any  time.  A,  wiU  be  the  alphabet  among  the  ones  the  ^nu- 
meration has  constructed  until  then,  which  yields  the  minimal  description  {AuT  ). 
The  reader  is  urged  to  pause,  consider  this  algorithm  a  Uttle  more  closely,  and  try 
to  answer  the  question:  What  is  its  running  time? 

The  way  things  have  been  defined  make  this  question  particularly  difficult,  even 
in  its  formulation.  Algorithm  4.2.1  gives  an  output  right  away,  so  the  running  time 
seems  to  be  0(1).  However  that  is  not  completely  true:  at  the  beginning  the  output 
is  quite  useless  but  it  improves  with  time.  So,  how  long  does  it  take  it  to  give  the 
best  output?  Well,  we  know  that  there  might  not  exist  such  a  best^  Moreover,  even 

lit  is  a  consequence  of  the  non-computability  of  Kolmogorov  complexity 
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if  we  restrict  the  inputs  to  those  texts  for  which  a  best  exists  and  the  algorithm 
converges  to  the  best  in  the  limit,  stiU  the  running  time  will  depend  on  the  input 
structure,  which  is  not  known  until  an  output  is  produced.  The  discussion  about 
running  time  looks  interesting,  but  we  will  postpone  it. 

There  is  a  more  interesting  observation  to  make  about  the  amount  of  work 
done  by  this  algorithm:  whenever  the  output  is  changed,  the  algorithm  has  done  a 
maximum^  amount  of  work;  it  has  tried  everything  possible  before  getting  at  the 
right  spot. 

If  we  now  consider  the  behavior  of  Algorithm  4.2.1,  we  can't  help  noticing  that 
it  seems  pretty  dull.  Whenever  it  finds  a  new  alphabet  that  looks  better  than  the 
previous  one,  it  forgets  about  the  work  done  before  ajid  gets  so  excited  about  the 
new  result  that  it  takes  it  as  the  best  without  giving  it  a  second  thought. 

Despite  these  drawbacks,  there  is  something  very  good  about  this  algorithm:  in 
the  limit  it  yields  an  optimal  alphabet  and  the  efficiency  of  the  output  is  monoton- 
ically  nondecreasing. 

4.2.2      Hierarchical  Enumeration 

Algorithm  4.2.1  presents  all  the  problems  that  enumeration  algorithms  have:  it  is 
infeasible  for  real  appHcations.  As  we  have  noticed,  the  amount  of  work  that  it  does 
is  majdmum.  So  we  ask  the  following  question:  Is  it  possible  to  save  it  some  work? 

Fortunately  our  model  is  "less  abstract"  then  other  models  for  structures  and 
makes  the  algorithm  amenable  to  improvements.  We  will  be  able  to  do  this  by 
providing  to  our  algorithm,  a  little  consciousness  about  the  current  best  output, 
i.e.  with  a  little  bit  of  memory  about  the  work  done  in  the  past,  and  by  posing  a 
bound  on  the  majdmum  amount  of  work  performed:  instead  of  enumerating  all  of 
the  alphabets,  we  will  limit  the  enumeration  to  a  small  subset  of  them  by  putting 
a  limit  on  their  size.  The  algorithm  is  given  in  Figure  4.2 

We  should  be  precise  about  the  algorithmic  notation  used.  The  first  assumption 
is  that  the  "loop  H"  construct  is  smart  enough  to  find  out  if  there  has  been  a 
^Maximum  with  respect  to  the  enumeration  chosen. 
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Algorithm  2 

Input:  ajQ  alphabet  Ao,  a  text  T. 
Current  output:  aji  alphabet  Ai. 

Let  Ai  =  Ao 
loop  H 

for  all  alphabets  B  over  Ai  such  that  |5| 


<h 


{B,T^)  <\{AuT''')\ihenA,  =  B. 


end  for  all 
end  loop  H 


Figure  4.2:  Hierachical  Enumeration 

progress,  i.e.  it  incorporates  a  test  to  find  out  if  ^i  has  changed  since  the  previous 
iteration.  Second  we  should  say  something  about  the  assignment  Ai  =  B.  Note 
that  B  is  an  alphabet  over  Ai  but  we  want  Ai  to  remain  flat  over  Aq.  So  we 
assume  that  the  code  part  of  5  is  a  code  over  the  code  part  of  A,  and  that  the 
classification  part  of  5  is  a  classification  over  the  classification  part  of  Aq.  Moreover 
whenever  s  is  formed  as  a  string  of  the  strings  5i, ...,  5„  then  7P{s)  is  the  concatenation 
■tp{si)...il){3n),  analogously  if  a  class  c  is  formed  as  a  class  of  Ci,...,c,^  G  Ai  then 
c  =  \Jci.  in  other  words  that  assignment  forgets  the  history  of  how  the  symbol  has 
evolved. 

We  can  consider  H  amd  h  SiS  inner  parameter  and  call  Algorithm  4.2.2,  U{h,  H). 
It  makes  sense  now  to  talk  about  the  running  time  of  U  since  there  are  bounds, 
namely  H  and  h  which  guarantee  that  it  wiU  eventuaUy  stop.  Let  us  denote  with  / 
its  running  time.  It  is  obvious  that  /  depends  on  the  text  T  (and  on  the  number  of 
symbols  n  of  the  alphabet  of  T),  on  the  bound  for  the  enumeration  /i,  and  on  the 
number  of  levels  H.  Then  f{T,h,H)  and  it  is  easy  to  see  that  f{T,h,l)  =  0(2"  ) 
for  every  T. 
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Let  us  observe  that  Algorithm  4.2.1  is  the  same  as  U{oo,  1).  Hence  it  make  sense 
to  consider  h  as  the  accuracy  parajneter,  taken  as  efficiency  of  the  representation 
found,  and  H  as  the  speed  parameter  in  term  of  running  time.  In  fact  the  number  of 
texts  for  which  the  most  efficient  representation  is  found  by  U{h,H)  grows  with  h. 
On  the  other  hand  the  maximum  size  of  the  adphabet  constructed  by  U{h,H)  can 
get  as  large  as  h^,  in  fact,  the  number  of  leaves  of  a  tree  with  nodes  with  h  children 
and  H  generations  can  grow  as  much  as  /i^.  In  order  to  understand  better  the  idea 
of  how  it  is  possible  to  improve  speed  at  the  cost  of  optimality  of  the  solution  we 
should  compare  U{h,l)  with  ?7(2,log2 /i).  The  texts  for  which  U{h,l)  will  find  a 
structure  while  f/(2,log2  h)  wiU  not  are  all  those  that  show  all  the  possible  digrams 
with  the  same  frequencies  but  still  with  a  structure  at  a  higher  level. 

Let  us  say  that  we  are  not  interested  in  such  texts^,  but  if  we  had  to  deal  with 
them,  it  is  possible  to  appropriately  modify  the  situation  to  overcome  the  problem, 
for  instajice  by  dovetailing  on  h.  So  let  us  concentrate  only  on  those  text  T  such 
that  17(2, log2)  and  U{h,l)  yield  the  same  result,  i.e,  build  the  same  alphabet. 

In  the  light  of  what  has  been  observed  let  us  consider  the  algorithm  in  Figure  , 
with  the  convention  that  the  input  to  U{h,  H)  are  put  in  square  brackets,  note 
that  the  assignment  statement  now  does  not  create  any  problem  and  that  the  same 
convention  as  before  holds  for  the  loop  statement. 

Let  us  call  this  algorithm  U2,  again  parametrized  by  H  and  L.  Note  that  the 
assumptions  that  we  have  made  about  the  texts  that  we  are  considering,  imply 
U{h,l)  =  C/^(2,log2 /i)  =  C/^2(log2  ^,  1)  so  1/2  is  stiU  universal.  U2  exhibits  a  nice 
characteristic:  when  the  search  in  the  call  for  U  has  stopped  at  the  value  H  and 
no  more  progress  is  possible,  it  allows  to  repeat  the  process  starting  on  the  new 
alphabet  created.  The  interesting  fact  is  that  at  any  iteration  of  the  outer  loop  the 
new  alphabet  will  be  atomic. 

Consider  the  following  example: 

Example  26  Let  S  be  the  semigroup  over  the  set  of  strings 

{aaaccc,  bbbccc,  aaabbb,  cccccc} 


^Natural  language  texts  don't  have  this  property. 
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Algorithm  3 

Input:  an  alphabet  Aq,  a  text  T  on  ctn  alphabet  Aq 

Current  output:  an  alphabet  Al 

Veiriables:  a  hierarchical  family  of  zdphabets  {>li}^o 

loop  L 

Al  =  Al-1 

Al  =  U{2,H)[Al-i,T^''-'] 
end  loop  L 


Figure  4.3:  Procrustes  New  Generation 

and  T  any  concatenations  of  words  in  S.  Then  U{h,  1)  recognizes  the  structure  of 
T  completely  only  if  /i  >  6,  while  U{2,3)  will  do  as  well. 

Note  that  the  alphabet  found  by  U2{1,3)  is  the  same  as  the  one  found  by  [/(2,3), 
but  C/2(2,2)  finds  an  alphabet  which  looks  more  naturcd. 


Given  all  the  discussion  above,  one  can  take  values  that  the  parameters  of  the 
algorithms  take  to  learn  a  text  as  a  characteristic  of  the  text.  We  should  emphasize 
that  natural  texts  often  show  similar  chciracteristics  at  different  levels  [11,65].  We 
have  seen  that  there  are  ways  to  overcome  to  a  certadn  extent  the  problems  of 
enumerative  learning  algorithms  provided  the  text  exhibits  nice  properties.  We  can 
set  forth  the  conjecture  that  natural  languages  are  codes  that  evolved  a  structure 
that  makes  their  learning  "easy",  i.e.  the  same  convergence  is  attained  with  most 
search  strategies.  "Difficxilt"  texts,  like  the  ones  structured  on  random  sequences, 
are  accidents  that  can  happen  only  with  the  medicious  intervention  of  a  human  being 
designing  encodings  ad  hoc.  It  is  reasonable  to  expect  that  they  are  not  encountered 
when  natural  texts  are  considered. 


I 
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Algorithm  4 

Input:  an  alphabet  Ao,  a  text  T 
Current  output:  an  alphabet  Ai 
Variables:  a  hierarchical  family 

on  an  alphabet  Aq 
of  alphabets  {Ai}f^Q 

loop  L 

Ac  =  Al  =  Al-1 

mark  all  Ac  unvisited 

while  Ac  not  all  visited 

choose  an  unvisited  s  G 

Ac  for  which  CONDITION  holds 

let  D  be  the  set  of  all  2- 

grams  st  such  that  t  £  Ac 

Ac.  =AcUD-  {s} 

iiX^Aj^{Ac)  >  \^^{Ac') 

then  Ac  =  Ac-;  mark  all  Ac 

unvisited 

else  mark  5  visited 

end  while 

Al  =  Ac 

end  loop  L 

Figure  4.4:  Heuristic  Approach  Algorithm 

Apart  from  these  considerations  on  natural  languages,  we  could  look  more  atten- 
tively into  the  abstract  capabiUties  of  Algorithm  4.2.2.  It  is  true  that  the  assumption 
that  we  made  on  the  nature  of  the  text  that  U2  learns  look  strong  at  a  first  glance 
and  are  likely  to  leave  out  a  lot  of  texts.  Nonetheless  we  should  remember  the  result 
[21]  that  of  all  the  n^  strings  of  length  h  over  an  alphabet  with  n  symbols,  only 
0{h)  are  structured.  Informally  speaking,  structured  texts  are  relatively  few. 

4.2.3      Heuristic  Approach 

Despite  the  big  improvements  on  Algorithm  4.2.1,  Algorithm  4.2.2  still  presents 
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a  certain  degree  of  enumeration.  It  is  impossible  to  tell  a  priori  what  is  the  size  of 
the  alphabet  that  it  will  be  enumerating  on,  since  that  is  known  only  at  the  first 
iteration.  So,  in  aji  actual  applicative  situation,  it  is  probably  a  good  idea  to  change 
the  "forall"  statement  with  a  nondeterministic  "choose"  and  then  maJce  the  choice 
on  the  base  of  a  priori  heuristic  judgements.  Of  course,  if  experimentation  proves  it 
worthwhile,  nothing  prevents  the  heuristic  considerations  on  which  they  are  based 
to  be  turned  into  proofs  by  meajis  of  a  more  attentive  study. 

The  algorithm  given  in  Figure  4.4  behaves  essentially  as  Algorithm  4.2.2  but  it 
avoids  the  enumeration  by  constructing  an  alphabet  at  the  provisional  level  which 
very  likely  will  provide  a  better  description,  that  is  a  lower  A  coefficient. 

CONDITION  should  be  set  appropriately.  Reasonable  conditions  are,  for  in- 
stcince 

•  p{s)  is  minimum 

•  P(*)p(^l*)  *^  maximum 

Anyway,  the  correct  choice  for  CONDITION  depends  on  the  nature  of  the  text 
and  can  be  determined  by  experimentation. 

4.3      Open  Problems  and  Future  Work 

Thanks  to  their  simplicity  the  algorithms  in  the  previous  section  can  be  applied  to 
many  different  arezis.  They  can  be  the  object  of  further  study  or  they  can  be  used 
as  investigation  tools.  We  will  not  enumerate  all  the  possible  way  future  research 
in  the  subject  may  proceed. 

However  there  is  another  problem  which  is  not  immediately  evident,  on  which 
we  would  like  to  draw  attention. 
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4.3.1      A  Characterization  of  Natural  Languages 

Let  T  be  a  text  over  ^o  and  A  the  code  constructed  by  Algorithm  4.2.1  and  {-Al*}".,) 
the  hierarchy  of  codes  constructed  by  Algorithm  4.2.2,  when  the  search  is  limited 
only  to  codes.  Define  %  recursively  as  follows: 

Ah 

T  —  T  • 
-*t  —  -« i_i 

Let  us  make  the  additional  assumption  that  T  is  such  that  Ti  is  much  bigger 


with  respect  to   A^     ,  |7i|  >> 


A^ 


h 


In  this  situation  the  only  contribution  to  the  size  of  the  description  {A'l,7i)  is 
givenby  |7:_l|A7;_,(A^) 

Consider  again  the  hierarchy  {A^  =  (Xi,i/'t,^i!_i )}"=!•  Let  tl^ii^)  =  Si...Sk  and 
define  new  codes  {A'^  =  (Xi,  V',',  j4o)}"=i  recursively  in  the  following  way: 

7P[{x)  =  Tpiix) 

V'Ka^)  =  V'U(^i)-V'.'-iK) 

It  is  an  interesting  problem  to  characterize  the  texts  T  for  which  there  exists  a 
number  h  such  that 

a:  =  a 

For  such  texts  the  search  for  the  optimal  description  woidd  be  linear  in  the  number 
of  levels,  if  h  is  known.  Notice  that  natural  language  seem  to  have  a  hierarchical 
organization  where  h  is  small  (about  5). 

Another  interesting  problem  is  to  see  whether  there  exists  an  optimal  value  of 
h  which  would  would  insure  the  right  convergence  of  the  Algorithm  4.2.2  for  all 
structured  texts'*,  except  possibly  a  "small"  set  of  texts. 


'See  Introduction  for  an  explanation  of  what  "structured"  means 
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